Difference between Lucene stemmers: EnglishStemmer, PorterStemmer, LovinsStemmer

Question

Have anybody compared these stemmers from Lucene (package org.tartarus.snowball.ext): EnglishStemmer, PorterStemmer, LovinsStemmer? What are the strong/weak points of algorithms behind them? When each of them should be used? Or maybe there are some more algorithms available for english words stemming?

Thanks.

Fred Foo · Accepted Answer · 2011-11-11T11:55:22.310

18

The Lovins stemmer is a very old algorithm that is not of much practical use, since the Porter stemmer is much stronger. Based on some quick skimming of the source code, it seems PorterStemmer implements Porter's original (1980) algorithm, while EnglishStemmer implements his updated version, which should be better.

A stronger stemming algorithm (actually a lemmatizer) is available in the Stanford NLP tools. A Lucene-Stanford NLP by yours truly bridge is available here (API docs).

See also Manning, Raghavan & Schütze for general info about stemming and lemmatization.

edited Nov 11 '11 at 11:55

answered Feb 21 '11 at 17:20

Fred Foo

355,277
75
744
836

You're stemmer is GPL so it doesn't plug into the Apache Lucene Ecosystem very well. Would you consider changing the license of your stemmer from GPL to Apache 2 so that more people can can use it? – RonC Apr 05 '21 at 13:31

Will High · Answer 2 · 2013-10-23T21:31:15.727

I've tested the 3 Lucene stemmers available from org.apache.lucene.analysis.en version 4.4.0, which are EnglishMinimalStemFilter, KStemFilter and PorterStemFilter, in a document classification problem I'm working on. My results corroborate the claims made by the authors of Introduction to Information Retrieval that for small training corpora in document classification settings stemming is harmful, and for large corpora stemming makes no difference.

For search and indexing, stemming can be more useful (see, e.g., Jenkins & Smith), but even there the answer to your question depends on the details of what you're doing. There is no free lunch!

At the end of the day, nothing beats empirical tests of real code on real data. The only way you'll really know which is better is by running the stemmers for yourself in your application.

Difference between Lucene stemmers: EnglishStemmer, PorterStemmer, LovinsStemmer

2 Answers2

Linked