Apache Lucene: How to use TokenStream to manually accept or reject a token when indexing

Question

I am looking for a way to write a custom index with Apache Lucene (PyLucene to be precise, but a Java answer is fine).

What I would like to do is the following : When adding a document to the index, Lucene will tokenize it, remove stop words, etc. This is usually done with the Analyzer if I am not mistaken.

What I would like to implement is the following : Before Lucene stores a given term, I would like to perform a lookup (say, in a dictionary) to check whether to keep the term or discard it (if the term is present in my dictionary, I keep it, otherwise I discard it).

How should I proceed ?

Here is (in Python) my custom implementation of the Analyzer :

class CustomAnalyzer(PythonAnalyzer):

    def createComponents(self, fieldName, reader):

        source = StandardTokenizer(Version.LUCENE_4_10_1, reader)
        filter = StandardFilter(Version.LUCENE_4_10_1, source)
        filter = LowerCaseFilter(Version.LUCENE_4_10_1, filter)
        filter = StopFilter(Version.LUCENE_4_10_1, filter,
                            StopAnalyzer.ENGLISH_STOP_WORDS_SET)

        ts = tokenStream.getTokenStream()
        token = ts.addAttribute(CharTermAttribute.class_)
        offset = ts.addAttribute(OffsetAttribute.class_)

        ts.reset()

         while ts.incrementToken():
           startOffset = offset.startOffset()
           endOffset = offset.endOffset()
           term = token.toString()
           # accept or reject term 

         ts.end()
         ts.close()

           # How to store the terms in the index now ?

         return ????

Thank you for your guidance in advance !

EDIT 1 : After digging into Lucene's documentation, I figured it had something to do with the TokenStreamComponents. It returns a TokenStream with which you can iterate through the Token list of the field you are indexing.

Now there is something to do with the Attributes that I do not understand. Or more precisely, I can read the tokens, but have no idea how should I proceed afterward.

EDIT 2 : I found this post where they mention the use of CharTermAttribute. However (in Python though) I cannot access or get a CharTermAttribute. Any thoughts ?

EDIT3 : I can now access each term, see update code snippet. Now what is left to be done is actually storing the desired terms...

score 0 · Answer 1 · edited May 23 '17 at 11:51

The way I was trying to solve the problem was wrong. This post and femtoRgon's answer were the solution.

By defining a filter extending PythonFilteringTokenFilter, I can make use of the function accept() (as the one used in the StopFilter for instance).

Here is the corresponding code snippet :

class MyFilter(PythonFilteringTokenFilter):

  def __init__(self, version, tokenStream):
    super(MyFilter, self).__init__(version, tokenStream)
    self.termAtt = self.addAttribute(CharTermAttribute.class_)


  def accept(self):
    term = self.termAtt.toString()
    accepted = False
    # Do whatever is needed with the term
    # accepted = ... (True/False)
    return accepted

Then just append the filter to the other filters (as in the code snipped of the question) :

filter = MyFilter(Version.LUCENE_4_10_1, filter)

Apache Lucene: How to use TokenStream to manually accept or reject a token when indexing

1 Answers1

Linked