How to tokenize only certain words in Lucene

Question

I'm using Lucene for my project and I need a custom Analyzer.

Code is:

public class MyCommentAnalyzer extends Analyzer {

@Override
    protected TokenStreamComponents createComponents( String fieldName, Reader reader ) {

      Tokenizer source = new StandardTokenizer( Version.LUCENE_48, reader );
      TokenStream filter = new StandardFilter( Version.LUCENE_48, source );

      filter = new StopFilter( Version.LUCENE_48, filter, StandardAnalyzer.STOP_WORDS_SET );

      return new TokenStreamComponents( source, filter );
}

}

I've built it, but now I can't go on. My needs is that the filter must select only certain words. Like an opposite process compared to use stopwords: don't remove from a wordlist, but add only the terms in the wordlist. Like a prebuilt dictionary. So the StopFilter doesn't fill the target. And none of the filters Lucene provides seems good. I think I need to write my own filter, but don't know how.

Any suggestion?

score 3 · Accepted Answer · answered Jun 10 '14 at 23:41

3

You're right to look to StopFilter for a starting point, so read the source!

Most of StopFilter's source is all convenience methods for building the stopset. You can safely ignore all that (unless you want to keep it around for building your keep set).

Cut all that, and StopFilter boils down to:

public final class StopFilter extends FilteringTokenFilter {

    private final CharArraySet stopWords;
    private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);

    public StopFilter(Version matchVersion, TokenStream in, CharArraySet stopWords) {
        super(matchVersion, in);
        this.stopWords = stopWords;
    }

    @Override
    protected boolean accept() {
        return !stopWords.contains(termAtt.buffer(), 0, termAtt.length());
    }
}

FilteringTokenFilter is a pretty simple class to implement. The key is just the accept method. When it's called for the current term, if it returns true, the term is added to the output stream. If it returns false, the current term is discarded.

So the only thing you really need to change in StopFilter is to delete a single character, to make accept return the opposite of what it currently does. Wouldn't hurt to change a few names here and there, as well.

public final class KeepOnlyFilter extends FilteringTokenFilter {

    private final CharArraySet keepWords;
    private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class);

    public KeepOnlyFilter(Version matchVersion, TokenStream in, CharArraySet keepWords) {
        super(matchVersion, in);
        this.keepWords = keepWords;
    }

    @Override
    protected boolean accept() {
        return keepWords.contains(termAtt.buffer(), 0, termAtt.length());
    }
}

answered Jun 10 '14 at 23:41

femtoRgon

32,893
7
60
87

1

Wow, this is awesome!!! This resolve exactly my issues, man. You saved me. Now suppose I want to choose not only the keepWords, but even its neighborhood, I mean having terms like n-grams. In particular, I need 2-grams and 3-grams. The purpose is to catch adjectives, like "awesome full body" or "great sharpness". Do I have to pass filter through NGramFilter class after the KeepOnlyFilter one? – PatrickBateman1981 Jun 11 '14 at 17:01
If you want to *search* for 2-grams and 3-grams, then yes, pass it through `NGramFilter` before this. If your final analyzed representation should be full words, and you just want this filter to match as 2-3 grams, then you'll just want to implement the appropriate matching logic in the `accept` method. – femtoRgon Jun 11 '14 at 17:15
Hi, your answer is nice and **almost** solves my problem. I need to filter out tokens not according to a list of stopWords, but query another (external) lexicon. Coudl you maybe help me on this ? For a quick answer, should I just override the `accept` method and so whatever I need inside of it ? http://stackoverflow.com/questions/39591094/apache-lucene-how-to-use-tokenstream-to-manually-accept-or-reject-a-token-when – Floran Gmehlin Sep 20 '16 at 13:30
1

@FloranGmehlin Yes, the accept method is the meat of it. The `charTermAttribute` is the current term, so compare that to whatever you like (keep in mind, you usually want TermFilters to be pretty fast. Any time you parse a query or index a document you will often be calling this method *many* times). You should definitely do this, or something like it, instead of resetting and iterating the stream in the `createComponents` class or your analyzer. – femtoRgon Sep 20 '16 at 18:38
@femtoRgon Awesome, thanks for your help and for the quick reply ! – Floran Gmehlin Sep 21 '16 at 08:56
@femtoRgon the Lucene module is an awesome beast. And finding your way around its labyrinth is non-trivial: so many classes and methods turn out to be `final`, and it almost seems like it's engineered to scare off mere mortals. But your solution has not only solved my problem, it has also thrown a lot of light on the way some of Lucene's core classes fit together. Tops. – mike rodent Mar 06 '17 at 22:35

How to tokenize only certain words in Lucene

1 Answers1

Linked