Why does NLTK's stemmers identify a different stem for Identification and Identifier? For Identification, both the Snowball and Porter stemmers return identif, but for Identifier, I get identifi. Are there any other stemmers that would be a bit more inclusive of different forms of words?
Asked
Active
Viewed 21 times
0
-
See https://stackoverflow.com/questions/17317418/stemmers-vs-lemmatizers/28954002 – alvas Jun 19 '18 at 16:19
-
Also, the output that a stemmer/lemmatizer should give is usually subjective w.r.t. to the task you're performing. In fact, if you use a proper wordnet lemmatizer, it'll recognize `indentifier` as a lemma and `indentification` as a different lemma because they appear in wordnet and are different synsets. – alvas Jun 19 '18 at 16:21
1 Answers
0
The output that a stemmer/lemmatizer should give is usually subjective w.r.t. to the task you're performing. In fact, if you use a proper wordnet lemmatizer, it'll recognize identifier
as a lemma and identification
as a different lemma because they appear in wordnet and are different synsets.
>>> from nltk.stem import WordNetLemmatizer
>>> from nltk.corpus import wordnet as wn
>>> wnl = WordNetLemmatizer()
>>> wnl.lemmatize('identifier')
'identifier'
>>> wnl.lemmatize('identification')
'identification'
>>> wn.synsets('identification')
[Synset('designation.n.03'), Synset('identification.n.02'), Synset('identification.n.03'), Synset('recognition.n.02'), Synset('identification.n.05')]
>>> wn.synsets('identifier')
[Synset('identifier.n.01')]
The questions to ask are:
What are you going to use the stems/lemmas for? Which task?
Do you have an inventory of senses/concept with the lemmas that you should follow in your task?
What is your expected behavior for the stemmer/lemmatizer and why do you think it is so? Would it really matter if stemming or lemmatization is an intermediate task?

alvas
- 115,346
- 109
- 446
- 738