1

I'm using lucene.net and the snowball analyzer in a asp.net application.

With a specific language I'm using I have the following issue: For two specific words with different meanings after they are stemmed the result is the same, therefore a search for any of them will produce results for both things.

How can I teach the analyzer either not to stem this two words or to, although stemming them, know that they have different meanings.

Gnomo
  • 407
  • 4
  • 13

2 Answers2

0

I am working from memory here but as I recall in one of the constructors you can pass an array of stopwords, which will stop the passed in words from being stemmed.

Lord Darth Vader
  • 1,895
  • 1
  • 17
  • 26
  • As far as I understand it, stop words are ignored during search. That's not what I want. I want to be able to search for these two words. What I'm missing is the analyzer ability to differentiate between them because after stemming they are equal. – Gnomo Feb 17 '14 at 14:44
0

With Lucene 4.0, EnglishAnalyzer now has this ability, since it has a constructor which takes a stemExclusionSet

Of course, Lucene.Net isn't up to Lucene 4 yet, so fat lot of good that does.

However, EnglishAnalyzer does this by using a KeywordMarkerFilter. So you can create your own Analyzer, overriding the tokenStream method, and adding into the chain a KeywordMarkerFilter just before the SnowballFilter.

Something like:

public TokenStream tokenStream(String fieldName, Reader reader) {
    TokenStream result = new StandardTokenizer(reader);
    result = new StandardFilter(result);
    result = new LowerCaseFilter(result);
    if (stopSet != null)
        result = new StopFilter(result, stopSet);
    result = new KeywordMarkerFilter(result, stemExclusionSet);
    result = new SnowballFilter(result, name);
    return result;
}

You'll need to construct your own stemExclusionSet (see CharArraySet).

femtoRgon
  • 32,893
  • 7
  • 60
  • 87