3

i'm trying to tokenize and stem a portuguese sentence using Lucene 4.

Based on this [thread] (How to use a Lucene Analyzer to tokenize a String?) i was abble to correctly tokenize a portuguese sentence. However, no stemming were been applied. Thus, reading the Lucene 4 documentation, i found this class [BrazilianStemmer] (https://lucene.apache.org/core/4_0_0/analyzers-common/org/apache/lucene/analysis/br/BrazilianStemmer.html).

I alter my code to use this BrazilianStemmer class.

    public static StringBuffer tokenizeString(StringBuffer text) {
    StringBuffer result = new StringBuffer();

    try {

        Analyzer analyzer = new PortugueseAnalyzer();

        TokenStream stream  = analyzer.tokenStream(null, new StringReader(text.toString()));
        stream.reset();

        BrazilianStemFilter filter = new BrazilianStemFilter(stream);

        while (filter.incrementToken()) {
            result.append(filter.getAttribute(CharTermAttribute.class).toString());
            result.append(" ");
        }

        filter.close();
        analyzer.close();
    } catch (IOException e) {
        throw new RuntimeException(e);
    }
    return result;
}

But, I'm not sure that it is working properly. Is this the right and better way to achieve a stemming from Lucene for foreign languages?

Community
  • 1
  • 1

1 Answers1

0

That is not the right way to do that you are applying a stemmer twice, because the PortugueseAnalyzer use internally a PortugueseLightStemFilter as you can see here.

Is better if you create your custom analyzer like this:

    Analyzer analyzer = new Analyzer() {
       @Override
        protected Analyzer.TokenStreamComponents createComponents(String fieldName, Reader reader) {
            final Tokenizer source = new StandardTokenizer(reader);
            TokenStream result = new LowerCaseFilter(source);
            result = new StopFilter(result, PortugueseAnalyzer.getDefaultStopSet());
            result = new BrazilianStemFilter(result);
            return new TokenStreamComponents(source, result);
        }
    };
Alejandro
  • 871
  • 8
  • 13