I'm writing a text classification system in Python. This is what I'm doing to canonicalize each token:
lem, stem = WordNetLemmatizer(), PorterStemmer()
for doc in corpus:
for word in doc:
lemma = stem.stem(lem.lemmatize(word))
The reason I don't want to just lemmatize is because I noticed that WordNetLemmatizer
wasn't handling some common inflections. In the case of adverbs, for example, lem.lemmatize('walking')
returns walking
.
Is it wise to perform both stemming and lemmatization? Or is it redundant? Do researchers typically do one or the other, and not both?