Class Segmenter

  • All Implemented Interfaces:
    ISegment
    Direct Known Subclasses:
    ComplexSeg, MostSeg, SimpleSeg

    public abstract class Segmenter
    extends Object
    implements ISegment
    abstract segmentation super class: 1. implemented the ISegment interface 2. implemented all the common functions that simple, complex, most segmentation algorithm will all share.
    Author:
    chenxin
    • Field Detail

      • idx

        protected int idx
        the index value of the current input stream mainly for track the start position of the token
      • wordPool

        protected final LinkedList<IWord> wordPool
        CJK word cache pool, Reusable string buffer and the array list for basic integer
      • behindLatin

        protected String behindLatin
        global behind Latin word after the CJK word added at 2016/11/22 for better mixed word implementation
      • ctrlMask

        protected int ctrlMask
        segmentation runtime function control mask
      • dic

        public final ADictionary dic
        the dictionary and task configuration instance
    • Constructor Detail

      • Segmenter

        public Segmenter​(SegmenterConfig config,
                         ADictionary dic)
        initialize the segment
        Parameters:
        config - Jcseg task configuration instance
        dic - Jcseg dictionary instance
    • Method Detail

      • readNext

        protected int readNext()
                        throws IOException
        read the next char from the current position
        Throws:
        IOException
      • pushBack

        protected void pushBack​(int data)
        push back the data to the stream.
        Parameters:
        data -
      • pushBack

        protected void pushBack​(String str)
        push back a string to the stream
        Parameters:
        str -
      • getStreamPosition

        public int getStreamPosition()
        Description copied from interface: ISegment
        get the current length of the stream
        Specified by:
        getStreamPosition in interface ISegment
      • getDict

        public ADictionary getDict()
        get the current dictionary instance.
        Returns:
        ADictionary
      • getConfig

        public SegmenterConfig getConfig()
        get the current task configuration instance.
      • getNextCJKWord

        protected IWord getNextCJKWord​(int c,
                                       int pos)
                                throws IOException
        get the next CJK word from the current position of the input stream
        Parameters:
        c -
        pos -
        Returns:
        IWord could be null and that mean we reached a stop word
        Throws:
        IOException
      • getNextLatinWord

        protected IWord getNextLatinWord​(int c,
                                         int pos)
                                  throws IOException
        get the next Latin word from the current position of the input stream
        Parameters:
        c -
        pos -
        Returns:
        IWord could be null and that mean we reached a stop word
        Throws:
        IOException
      • getNextMixedWord

        protected IWord getNextMixedWord​(char[] chars,
                                         int cjkidx)
                                  throws IOException
        get the next mixed word, CJK-English or CJK-English-CJK or whatever
        Parameters:
        chars -
        cjkidx -
        Returns:
        IWord or null for nothing found
        Throws:
        IOException
      • getNextPunctuationPairWord

        protected IWord getNextPunctuationPairWord​(int c,
                                                   int pos)
                                            throws IOException
        get the next punctuation pair word from the current position of the input stream.
        Parameters:
        c -
        pos -
        Returns:
        IWord could be null and that mean we reached a stop word
        Throws:
        IOException
      • appendCJKWordFeatures

        protected void appendCJKWordFeatures​(IWord word)
        check and append the pinyin and the synonyms words of the specified word
        Parameters:
        word -
      • appendLatinWordFeatures

        protected void appendLatinWordFeatures​(IWord w)
        Check and append the synonyms/pinyin words of specified word included the CJK and basic Latin words All the synonyms words share the same position part of speech, word type with the primitive word
        Parameters:
        w -
      • enSecondSegFilter

        protected boolean enSecondSegFilter​(IWord w)
        interface to check and do the English secondary segmentation. Override this method to control the secondary logic.
        Parameters:
        w -
        Returns:
        boolean
      • enSecondSeg

        protected LinkedList<IWord> enSecondSeg​(IWord w,
                                                LinkedList<IWord> wList)

        Do the secondary split for the specified complex Latin word This will split a complex English, Arabic, punctuation compose word to multiple simple parts Like 'qq2013' will split to 'qq' and '2013'

        And all the sub words share the same type and part of speech with the primitive word You should check the config.EN_SECOND_SEG before invoke this method

        Parameters:
        w -
        wList -
        Returns:
        LinkedList all the sub word tokens
      • enWordSeg

        protected LinkedList<IWord> enWordSeg​(IWord w,
                                              LinkedList<IWord> wList)
        Latin word lexicon based English word segmentation.
        Parameters:
        w -
        wList -
        Returns:
        LinkedList all the sub word tokens
      • getNextMatch

        protected IWord[] getNextMatch​(int maxLen,
                                       char[] chars,
                                       int index,
                                       List<IWord> wList)
        match the next CJK word in the dictionary
        Parameters:
        maxLen -
        chars -
        index -
        wList -
        Returns:
        IWord[]
      • findCHName

        protected String findCHName​(char[] chars,
                                    int index,
                                    IChunk chunk)
        find an Chinese name from the current position of the input chars
        Parameters:
        chars -
        index -
        chunk -
        Returns:
        IWord
      • nextCJKSentence

        protected char[] nextCJKSentence​(int c)
                                  throws IOException
        load a CJK char list from the stream start from the current position till the char is not a CJK char
        Parameters:
        c -
        Returns:
        char[]
        Throws:
        IOException
      • nextLatinWord

        protected IWord nextLatinWord​(int c,
                                      int pos)
                               throws IOException
        find the letter or digit word from the current position count until the char is whitespace or not letter_digit
        Parameters:
        c -
        pos -
        Returns:
        IWord
        Throws:
        IOException
      • nextLatinString

        protected String nextLatinString​(int c)
                                  throws IOException
        the simple version of the next basic Latin fetch logic Just return the next Latin string with the keep punctuation after it
        Parameters:
        c -
        Returns:
        String
        Throws:
        IOException
      • nextLetterNumber

        protected String nextLetterNumber​(int c)
                                   throws IOException
        find the next other letter from the current position find the letter number from the current position count until the char in the specified position is not a letter number or whitespace
        Parameters:
        c -
        Returns:
        String
        Throws:
        IOException
      • nextOtherNumber

        protected String nextOtherNumber​(int c)
                                  throws IOException
        find the other number from the current position count until the char in the specified position is not a other number or whitespace
        Parameters:
        c -
        Returns:
        String
        Throws:
        IOException
      • nextCNNumeric

        protected String nextCNNumeric​(char[] chars,
                                       int index)
                                throws IOException
        find the Chinese number from the current position count until the char in the specified position is not a other number or whitespace
        Parameters:
        chars - char array of CJK items
        index -
        Returns:
        String[]
        Throws:
        IOException
      • getPairPunctuationText

        protected String getPairPunctuationText​(int c)
                                         throws IOException
        find pair punctuation of the given punctuation char the purpose is to get the text between them
        Parameters:
        c -
        Throws:
        IOException
      • wordNewOrClone

        public IWord wordNewOrClone​(int t,
                                    String str,
                                    int type)
        check if the specified word is existed in a specified dictionary and if it does clone it or create a new one. Note: why we need this ? clone will extend all the features from the original word item including part of speech, pinyin, synonyms etc.
        Parameters:
        t -
        str -
        type -
      • getBestChunk

        protected IChunk getBestChunk​(char[] chars,
                                      int index,
                                      int maxLen)
        an abstract method to get word from the current position with MMSEG algorithm. simpleSeg and ComplexSeg is different to deal with this so make it a abstract method here
        Parameters:
        chars -
        index -
        maxLen -
        Returns:
        IChunk