i'm trying to tokenize and stem a portuguese sentence using Lucene 4.
Based on this [thread] (How to use a Lucene Analyzer to tokenize a String?) i was abble to correctly tokenize a portuguese sentence. However, no stemming were been applied. Thus, reading the Lucene 4 documentation, i found this class [BrazilianStemmer] (https://lucene.apache.org/core/4_0_0/analyzers-common/org/apache/lucene/analysis/br/BrazilianStemmer.html).
I alter my code to use this BrazilianStemmer class.
public static StringBuffer tokenizeString(StringBuffer text) {
StringBuffer result = new StringBuffer();
try {
Analyzer analyzer = new PortugueseAnalyzer();
TokenStream stream = analyzer.tokenStream(null, new StringReader(text.toString()));
stream.reset();
BrazilianStemFilter filter = new BrazilianStemFilter(stream);
while (filter.incrementToken()) {
result.append(filter.getAttribute(CharTermAttribute.class).toString());
result.append(" ");
}
filter.close();
analyzer.close();
} catch (IOException e) {
throw new RuntimeException(e);
}
return result;
}
But, I'm not sure that it is working properly. Is this the right and better way to achieve a stemming from Lucene for foreign languages?