Package org.lionsoul.jcseg.segmenter
Class NLPSeg
- java.lang.Object
-
- org.lionsoul.jcseg.segmenter.Segmenter
-
- org.lionsoul.jcseg.segmenter.ComplexSeg
-
- org.lionsoul.jcseg.segmenter.NLPSeg
-
- All Implemented Interfaces:
Serializable,ISegment
public class NLPSeg extends ComplexSeg
NLP segmentation implementation And this extends all the properties of the Complex one the rest of them are build for NLP only- Author:
- chenxin
- See Also:
- Serialized Form
-
-
Nested Class Summary
-
Nested classes/interfaces inherited from interface org.lionsoul.jcseg.ISegment
ISegment.Type
-
-
Field Summary
-
Fields inherited from class org.lionsoul.jcseg.segmenter.Segmenter
behindLatin, config, ctrlMask, dic, iaList, idx, isb, reader, subWordPool, wordPool
-
Fields inherited from interface org.lionsoul.jcseg.ISegment
CHECK_CE_MASk, CHECK_CF_MASK, CHECK_EC_MASK, COMPLEX, COMPLEX_MODE, DELIMITER, DELIMITER_MODE, DETECT, DETECT_MODE, MOST, MOST_MODE, NGRAM, NGRAM_MODE, NLP, NLP_MODE, SIMPLE, SIMPLE_MODE, START_SS_MASK
-
-
Constructor Summary
Constructors Constructor Description NLPSeg(SegmenterConfig config, ADictionary dic)
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description protected booleanenSecondSegFilter(IWord w)interface to check and do the English secondary segmentation.protected IWordgetNextCJKWord(int c, int pos)get the next CJK word from the current position of the input streamprotected IWordgetNextDatetimeWord(IWord word, int entityIdx)get and return the next date-time wordprotected IWordgetNextTheWord(IWord word)get the next the_xxx word like '第x个', '第x集' EG ...protected IWordgetNextTimeMergedWord(IWord word, int eIdx)get and return the next time merged date-time wordIWordgetNumericUnitComposedWord(int numeric, IWord unitWord)IWordnext()Override the next method to add the date-time entity recognition And we also invoke the parent.next method to get the next tokenprotected IWordnextLatinWord(int c, int pos)find the letter or digit word from the current position count until the char is whitespace or not letter_digit-
Methods inherited from class org.lionsoul.jcseg.segmenter.ComplexSeg
getBestChunk, printChunks
-
Methods inherited from class org.lionsoul.jcseg.segmenter.Segmenter
appendCJKWordFeatures, appendLatinWordFeatures, enSecondSeg, enWordSeg, findCHName, getConfig, getDict, getNextLatinWord, getNextMatch, getNextMixedWord, getNextPunctuationPairWord, getPairPunctuationText, getStreamPosition, nextCJKSentence, nextCNNumeric, nextLatinString, nextLetterNumber, nextOtherNumber, pushBack, pushBack, readNext, reset, wordNewOrClone
-
-
-
-
Constructor Detail
-
NLPSeg
public NLPSeg(SegmenterConfig config, ADictionary dic)
-
-
Method Detail
-
next
public IWord next() throws IOException
Override the next method to add the date-time entity recognition And we also invoke the parent.next method to get the next token- Specified by:
nextin interfaceISegment- Overrides:
nextin classSegmenter- Throws:
IOException- See Also:
Segmenter.next()
-
getNextTheWord
protected IWord getNextTheWord(IWord word) throws IOException
get the next the_xxx word like '第x个', '第x集' EG ...- Parameters:
word-- Returns:
- IWord
- Throws:
IOException
-
getNextTimeMergedWord
protected IWord getNextTimeMergedWord(IWord word, int eIdx) throws IOException
get and return the next time merged date-time word- Parameters:
word-eIdx-- Returns:
- IWord
- Throws:
IOException
-
getNextDatetimeWord
protected IWord getNextDatetimeWord(IWord word, int entityIdx) throws IOException
get and return the next date-time word- Parameters:
word-entityIdx-- Returns:
- IWord
- Throws:
IOException
-
getNextCJKWord
protected IWord getNextCJKWord(int c, int pos) throws IOException
Description copied from class:Segmenterget the next CJK word from the current position of the input stream- Overrides:
getNextCJKWordin classSegmenter- Returns:
- IWord could be null and that mean we reached a stop word
- Throws:
IOException- See Also:
Segmenter.getNextCJKWord(int, int)
-
enSecondSegFilter
protected boolean enSecondSegFilter(IWord w)
Description copied from class:Segmenterinterface to check and do the English secondary segmentation. Override this method to control the secondary logic.- Overrides:
enSecondSegFilterin classSegmenter- Returns:
- boolean
- See Also:
Segmenter.enSecondSegFilter(IWord)
-
nextLatinWord
protected IWord nextLatinWord(int c, int pos) throws IOException
find the letter or digit word from the current position count until the char is whitespace or not letter_digit- Overrides:
nextLatinWordin classSegmenter- Parameters:
c-pos-- Returns:
- IWord
- Throws:
IOException
-
-