Should I perform both lemmatization and stemming?

Question

I'm writing a text classification system in Python. This is what I'm doing to canonicalize each token:

lem, stem = WordNetLemmatizer(), PorterStemmer()
for doc in corpus:
    for word in doc:
        lemma = stem.stem(lem.lemmatize(word))

The reason I don't want to just lemmatize is because I noticed that WordNetLemmatizer wasn't handling some common inflections. In the case of adverbs, for example, lem.lemmatize('walking') returns walking.

Is it wise to perform both stemming and lemmatization? Or is it redundant? Do researchers typically do one or the other, and not both?

Take a look at https://www.kaggle.com/alvations/basic-nlp-with-nltk ("Stemming and Lemmatization" section) — alvas, Mar 19 '18 at 05:27
Duplicate of https://stackoverflow.com/questions/17317418/stemmers-vs-lemmatizers. — alvas, Mar 19 '18 at 05:34

score 6 · Answer 1 · answered Oct 02 '18 at 06:41

From my point of view, doing both stemming and lemmatization or only one will result in really SLIGHT differences, but I recommend for use just stemming because lemmatization sometimes need 'pos' to perform more presicsely.

For example, if you want to lemmatize "better", you should explicitly indicate pos: print(lemmatizer.lemmatize("better", pos="a"))

If not supplied, the default is "noun"

KonstantinosKokos · Answer 2 · 2018-03-19T10:17:19.490

2

The lemmatization of walking is ambiguous. Walking, when used as an adjective, is its own baseform (rather than walk).

Correction: Research has shown that generally stemming outperforms lemmatization in IR tasks. A qualitative comparison between the two and an explanation can be found here.

edited Mar 19 '18 at 10:17

answered Mar 19 '18 at 10:10

KonstantinosKokos

3,369
1
11
21

2

May I ask what is "IR" tasks? – theEconCsEngineer Aug 24 '21 at 03:37
@theEconCsEngineer [Information retrieval](https://en.wikipedia.org/wiki/Information_retrieval). – Mew Mar 03 '22 at 13:06

score 1 · Answer 3 · answered Mar 21 '18 at 03:22

I think stemming a lemmatized word is redundant if you get the same result than just stemming it (which is the result I expect). Nevertheless, the decision between stemmer and lemmatizer depends on your need. My intuition said that steamming increses recall and lowers precision and the opposite for a lemmatization. Consider this scores, what matter for your specific problem? Other option talking about this scores is to calculate F-1 score which is the harmonic average of the precision and recall.

Should I perform both lemmatization and stemming?

3 Answers3

Linked