improve lucene.net analyzer

Question

I'm using lucene.net and the snowball analyzer in a asp.net application.

With a specific language I'm using I have the following issue: For two specific words with different meanings after they are stemmed the result is the same, therefore a search for any of them will produce results for both things.

How can I teach the analyzer either not to stem this two words or to, although stemming them, know that they have different meanings.

score 0 · Answer 1 · answered Feb 17 '14 at 13:28

0

I am working from memory here but as I recall in one of the constructors you can pass an array of stopwords, which will stop the passed in words from being stemmed.

answered Feb 17 '14 at 13:28

Lord Darth Vader

1,895
1
17
26

As far as I understand it, stop words are ignored during search. That's not what I want. I want to be able to search for these two words. What I'm missing is the analyzer ability to differentiate between them because after stemming they are equal. – Gnomo Feb 17 '14 at 14:44

score 0 · Answer 2 · answered Feb 19 '14 at 00:09

With Lucene 4.0, EnglishAnalyzer now has this ability, since it has a constructor which takes a stemExclusionSet

Of course, Lucene.Net isn't up to Lucene 4 yet, so fat lot of good that does.

However, EnglishAnalyzer does this by using a KeywordMarkerFilter. So you can create your own Analyzer, overriding the tokenStream method, and adding into the chain a KeywordMarkerFilter just before the SnowballFilter.

Something like:

public TokenStream tokenStream(String fieldName, Reader reader) {
    TokenStream result = new StandardTokenizer(reader);
    result = new StandardFilter(result);
    result = new LowerCaseFilter(result);
    if (stopSet != null)
        result = new StopFilter(result, stopSet);
    result = new KeywordMarkerFilter(result, stemExclusionSet);
    result = new SnowballFilter(result, name);
    return result;
}

You'll need to construct your own stemExclusionSet (see CharArraySet).

improve lucene.net analyzer

2 Answers2