Package org.lionsoul.jcseg.segmenter
Class NGramSeg
- java.lang.Object
-
- org.lionsoul.jcseg.segmenter.NGramSeg
-
-
Nested Class Summary
-
Nested classes/interfaces inherited from interface org.lionsoul.jcseg.ISegment
ISegment.Type
-
-
Field Summary
Fields Modifier and Type Field Description SegmenterConfigconfigADictionarydicthe dictionary and task configurationprotected intidxthe index value of the current input stream mainly for track the start position of the tokenprotected IStringBufferisbprotected byteNThe N for n-gram, default to 1 and that is uni-gramprotected IPushbackReaderreaderprotected LinkedList<IWord>wordPoolCJK word cache pool, Reusable string buffer-
Fields inherited from interface org.lionsoul.jcseg.ISegment
CHECK_CE_MASk, CHECK_CF_MASK, CHECK_EC_MASK, COMPLEX, COMPLEX_MODE, DELIMITER, DELIMITER_MODE, DETECT, DETECT_MODE, MOST, MOST_MODE, NGRAM, NGRAM_MODE, NLP, NLP_MODE, SIMPLE, SIMPLE_MODE, START_SS_MASK
-
-
Constructor Summary
Constructors Constructor Description NGramSeg(SegmenterConfig config, ADictionary dic)method to create a new ISegment
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description SegmenterConfiggetConfig()ADictionarygetDic()bytegetN()protected StringgetNextType(int c, int type, CharTypeFunction checker)common interface to get the next n-gram word for the specified char type.intgetStreamPosition()get the current length of the streamIWordnext()segment a word from a char array from a specified position.protected voidpushBack(int data)push back the data to the streamprotected intreadNext()read the next char from the current positionvoidreset(Reader input)reset the readervoidsetN(byte n)protected voidstreamResetTo(String str, int start)reset the data back from the specified positionIWordwordNewOrClone(int t, String str, int type)check if the specified word is existed in a specified dictionary and if does clone it or create a new one.
-
-
-
Field Detail
-
idx
protected int idx
the index value of the current input stream mainly for track the start position of the token
-
reader
protected IPushbackReader reader
-
wordPool
protected final LinkedList<IWord> wordPool
CJK word cache pool, Reusable string buffer
-
isb
protected final IStringBuffer isb
-
dic
public final ADictionary dic
the dictionary and task configuration
-
config
public final SegmenterConfig config
-
N
protected byte N
The N for n-gram, default to 1 and that is uni-gram
-
-
Constructor Detail
-
NGramSeg
public NGramSeg(SegmenterConfig config, ADictionary dic)
method to create a new ISegment- Parameters:
config-dic-
-
-
Method Detail
-
reset
public void reset(Reader input) throws IOException
Description copied from interface:ISegmentreset the reader- Specified by:
resetin interfaceISegment- Throws:
IOException
-
getStreamPosition
public int getStreamPosition()
Description copied from interface:ISegmentget the current length of the stream- Specified by:
getStreamPositionin interfaceISegment
-
readNext
protected int readNext() throws IOExceptionread the next char from the current position- Returns:
- int
- Throws:
IOException
-
pushBack
protected void pushBack(int data)
push back the data to the stream- Parameters:
data-
-
streamResetTo
protected void streamResetTo(String str, int start)
reset the data back from the specified position
-
next
public IWord next() throws IOException
Description copied from interface:ISegmentsegment a word from a char array from a specified position.- Specified by:
nextin interfaceISegment- Throws:
IOException
-
getNextType
protected String getNextType(int c, int type, CharTypeFunction checker) throws IOException
common interface to get the next n-gram word for the specified char type. For the basic Latin char this will automatically do the full-width to half-width uppercase to lowercase conversion.- Parameters:
c-type-checker-- Returns:
- IWord
- Throws:
IOException
-
wordNewOrClone
public IWord wordNewOrClone(int t, String str, int type)
check if the specified word is existed in a specified dictionary and if does clone it or create a new one.- Parameters:
t-str-type-
-
getDic
public ADictionary getDic()
-
getConfig
public SegmenterConfig getConfig()
-
getN
public byte getN()
-
setN
public void setN(byte n)
-
-