Package org.lionsoul.jcseg.segmenter
Class Segmenter
- java.lang.Object
-
- org.lionsoul.jcseg.segmenter.Segmenter
-
- All Implemented Interfaces:
ISegment
- Direct Known Subclasses:
ComplexSeg,MostSeg,SimpleSeg
public abstract class Segmenter extends Object implements ISegment
abstract segmentation super class: 1. implemented the ISegment interface 2. implemented all the common functions that simple, complex, most segmentation algorithm will all share.- Author:
- chenxin
-
-
Nested Class Summary
-
Nested classes/interfaces inherited from interface org.lionsoul.jcseg.ISegment
ISegment.Type
-
-
Field Summary
Fields Modifier and Type Field Description protected StringbehindLatinglobal behind Latin word after the CJK word added at 2016/11/22 for better mixed word implementationSegmenterConfigconfigprotected intctrlMasksegmentation runtime function control maskADictionarydicthe dictionary and task configuration instanceprotected IntArrayListiaListprotected intidxthe index value of the current input stream mainly for track the start position of the tokenprotected IStringBufferisbprotected IPushbackReaderreaderprotected LinkedList<IWord>subWordPoolprotected LinkedList<IWord>wordPoolCJK word cache pool, Reusable string buffer and the array list for basic integer-
Fields inherited from interface org.lionsoul.jcseg.ISegment
CHECK_CE_MASk, CHECK_CF_MASK, CHECK_EC_MASK, COMPLEX, COMPLEX_MODE, DELIMITER, DELIMITER_MODE, DETECT, DETECT_MODE, MOST, MOST_MODE, NGRAM, NGRAM_MODE, NLP, NLP_MODE, SIMPLE, SIMPLE_MODE, START_SS_MASK
-
-
Constructor Summary
Constructors Constructor Description Segmenter(SegmenterConfig config, ADictionary dic)initialize the segment
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description protected voidappendCJKWordFeatures(IWord word)check and append the pinyin and the synonyms words of the specified wordprotected voidappendLatinWordFeatures(IWord w)Check and append the synonyms/pinyin words of specified word included the CJK and basic Latin words All the synonyms words share the same position part of speech, word type with the primitive wordprotected LinkedList<IWord>enSecondSeg(IWord w, LinkedList<IWord> wList)Do the secondary split for the specified complex Latin word This will split a complex English, Arabic, punctuation compose word to multiple simple parts Like 'qq2013' will split to 'qq' and '2013'protected booleanenSecondSegFilter(IWord w)interface to check and do the English secondary segmentation.protected LinkedList<IWord>enWordSeg(IWord w, LinkedList<IWord> wList)Latin word lexicon based English word segmentation.protected StringfindCHName(char[] chars, int index, IChunk chunk)find an Chinese name from the current position of the input charsprotected IChunkgetBestChunk(char[] chars, int index, int maxLen)an abstract method to get word from the current position with MMSEG algorithm.SegmenterConfiggetConfig()get the current task configuration instance.ADictionarygetDict()get the current dictionary instance.protected IWordgetNextCJKWord(int c, int pos)get the next CJK word from the current position of the input streamprotected IWordgetNextLatinWord(int c, int pos)get the next Latin word from the current position of the input streamprotected IWord[]getNextMatch(int maxLen, char[] chars, int index, List<IWord> wList)match the next CJK word in the dictionaryprotected IWordgetNextMixedWord(char[] chars, int cjkidx)get the next mixed word, CJK-English or CJK-English-CJK or whateverprotected IWordgetNextPunctuationPairWord(int c, int pos)get the next punctuation pair word from the current position of the input stream.protected StringgetPairPunctuationText(int c)find pair punctuation of the given punctuation char the purpose is to get the text between themintgetStreamPosition()get the current length of the streamIWordnext()segment a word from a char array from a specified position.protected char[]nextCJKSentence(int c)load a CJK char list from the stream start from the current position till the char is not a CJK charprotected StringnextCNNumeric(char[] chars, int index)find the Chinese number from the current position count until the char in the specified position is not a other number or whitespaceprotected StringnextLatinString(int c)the simple version of the next basic Latin fetch logic Just return the next Latin string with the keep punctuation after itprotected IWordnextLatinWord(int c, int pos)find the letter or digit word from the current position count until the char is whitespace or not letter_digitprotected StringnextLetterNumber(int c)find the next other letter from the current position find the letter number from the current position count until the char in the specified position is not a letter number or whitespaceprotected StringnextOtherNumber(int c)find the other number from the current position count until the char in the specified position is not a other number or whitespaceprotected voidpushBack(int data)push back the data to the stream.protected voidpushBack(String str)push back a string to the streamprotected intreadNext()read the next char from the current positionvoidreset(Reader input)input stream and reader reset.IWordwordNewOrClone(int t, String str, int type)check if the specified word is existed in a specified dictionary and if it does clone it or create a new one.
-
-
-
Field Detail
-
idx
protected int idx
the index value of the current input stream mainly for track the start position of the token
-
reader
protected IPushbackReader reader
-
wordPool
protected final LinkedList<IWord> wordPool
CJK word cache pool, Reusable string buffer and the array list for basic integer
-
subWordPool
protected final LinkedList<IWord> subWordPool
-
isb
protected final IStringBuffer isb
-
iaList
protected final IntArrayList iaList
-
behindLatin
protected String behindLatin
global behind Latin word after the CJK word added at 2016/11/22 for better mixed word implementation
-
ctrlMask
protected int ctrlMask
segmentation runtime function control mask
-
dic
public final ADictionary dic
the dictionary and task configuration instance
-
config
public final SegmenterConfig config
-
-
Constructor Detail
-
Segmenter
public Segmenter(SegmenterConfig config, ADictionary dic)
initialize the segment- Parameters:
config- Jcseg task configuration instancedic- Jcseg dictionary instance
-
-
Method Detail
-
reset
public void reset(Reader input) throws IOException
input stream and reader reset.- Specified by:
resetin interfaceISegment- Parameters:
input-- Throws:
IOException
-
readNext
protected int readNext() throws IOExceptionread the next char from the current position- Throws:
IOException
-
pushBack
protected void pushBack(int data)
push back the data to the stream.- Parameters:
data-
-
pushBack
protected void pushBack(String str)
push back a string to the stream- Parameters:
str-
-
getStreamPosition
public int getStreamPosition()
Description copied from interface:ISegmentget the current length of the stream- Specified by:
getStreamPositionin interfaceISegment
-
getDict
public ADictionary getDict()
get the current dictionary instance.- Returns:
- ADictionary
-
getConfig
public SegmenterConfig getConfig()
get the current task configuration instance.
-
next
public IWord next() throws IOException
Description copied from interface:ISegmentsegment a word from a char array from a specified position.- Specified by:
nextin interfaceISegment- Throws:
IOException- See Also:
ISegment.next()
-
getNextCJKWord
protected IWord getNextCJKWord(int c, int pos) throws IOException
get the next CJK word from the current position of the input stream- Parameters:
c-pos-- Returns:
- IWord could be null and that mean we reached a stop word
- Throws:
IOException
-
getNextLatinWord
protected IWord getNextLatinWord(int c, int pos) throws IOException
get the next Latin word from the current position of the input stream- Parameters:
c-pos-- Returns:
- IWord could be null and that mean we reached a stop word
- Throws:
IOException
-
getNextMixedWord
protected IWord getNextMixedWord(char[] chars, int cjkidx) throws IOException
get the next mixed word, CJK-English or CJK-English-CJK or whatever- Parameters:
chars-cjkidx-- Returns:
- IWord or null for nothing found
- Throws:
IOException
-
getNextPunctuationPairWord
protected IWord getNextPunctuationPairWord(int c, int pos) throws IOException
get the next punctuation pair word from the current position of the input stream.- Parameters:
c-pos-- Returns:
- IWord could be null and that mean we reached a stop word
- Throws:
IOException
-
appendCJKWordFeatures
protected void appendCJKWordFeatures(IWord word)
check and append the pinyin and the synonyms words of the specified word- Parameters:
word-
-
appendLatinWordFeatures
protected void appendLatinWordFeatures(IWord w)
Check and append the synonyms/pinyin words of specified word included the CJK and basic Latin words All the synonyms words share the same position part of speech, word type with the primitive word- Parameters:
w-
-
enSecondSegFilter
protected boolean enSecondSegFilter(IWord w)
interface to check and do the English secondary segmentation. Override this method to control the secondary logic.- Parameters:
w-- Returns:
- boolean
-
enSecondSeg
protected LinkedList<IWord> enSecondSeg(IWord w, LinkedList<IWord> wList)
Do the secondary split for the specified complex Latin word This will split a complex English, Arabic, punctuation compose word to multiple simple parts Like 'qq2013' will split to 'qq' and '2013'
And all the sub words share the same type and part of speech with the primitive word You should check the config.EN_SECOND_SEG before invoke this method
- Parameters:
w-wList-- Returns:
- LinkedList
all the sub word tokens
-
enWordSeg
protected LinkedList<IWord> enWordSeg(IWord w, LinkedList<IWord> wList)
Latin word lexicon based English word segmentation.- Parameters:
w-wList-- Returns:
- LinkedList
all the sub word tokens
-
getNextMatch
protected IWord[] getNextMatch(int maxLen, char[] chars, int index, List<IWord> wList)
match the next CJK word in the dictionary- Parameters:
maxLen-chars-index-wList-- Returns:
- IWord[]
-
findCHName
protected String findCHName(char[] chars, int index, IChunk chunk)
find an Chinese name from the current position of the input chars- Parameters:
chars-index-chunk-- Returns:
- IWord
-
nextCJKSentence
protected char[] nextCJKSentence(int c) throws IOExceptionload a CJK char list from the stream start from the current position till the char is not a CJK char- Parameters:
c-- Returns:
- char[]
- Throws:
IOException
-
nextLatinWord
protected IWord nextLatinWord(int c, int pos) throws IOException
find the letter or digit word from the current position count until the char is whitespace or not letter_digit- Parameters:
c-pos-- Returns:
- IWord
- Throws:
IOException
-
nextLatinString
protected String nextLatinString(int c) throws IOException
the simple version of the next basic Latin fetch logic Just return the next Latin string with the keep punctuation after it- Parameters:
c-- Returns:
- String
- Throws:
IOException
-
nextLetterNumber
protected String nextLetterNumber(int c) throws IOException
find the next other letter from the current position find the letter number from the current position count until the char in the specified position is not a letter number or whitespace- Parameters:
c-- Returns:
- String
- Throws:
IOException
-
nextOtherNumber
protected String nextOtherNumber(int c) throws IOException
find the other number from the current position count until the char in the specified position is not a other number or whitespace- Parameters:
c-- Returns:
- String
- Throws:
IOException
-
nextCNNumeric
protected String nextCNNumeric(char[] chars, int index) throws IOException
find the Chinese number from the current position count until the char in the specified position is not a other number or whitespace- Parameters:
chars- char array of CJK itemsindex-- Returns:
- String[]
- Throws:
IOException
-
getPairPunctuationText
protected String getPairPunctuationText(int c) throws IOException
find pair punctuation of the given punctuation char the purpose is to get the text between them- Parameters:
c-- Throws:
IOException
-
wordNewOrClone
public IWord wordNewOrClone(int t, String str, int type)
check if the specified word is existed in a specified dictionary and if it does clone it or create a new one. Note: why we need this ? clone will extend all the features from the original word item including part of speech, pinyin, synonyms etc.- Parameters:
t-str-type-
-
getBestChunk
protected IChunk getBestChunk(char[] chars, int index, int maxLen)
an abstract method to get word from the current position with MMSEG algorithm. simpleSeg and ComplexSeg is different to deal with this so make it a abstract method here- Parameters:
chars-index-maxLen-- Returns:
- IChunk
-
-