How to get stemmers to recognize Identification and Identifier similarly?

Question

Why does NLTK's stemmers identify a different stem for Identification and Identifier? For Identification, both the Snowball and Porter stemmers return identif, but for Identifier, I get identifi. Are there any other stemmers that would be a bit more inclusive of different forms of words?

See https://stackoverflow.com/questions/17317418/stemmers-vs-lemmatizers/28954002 — alvas, Jun 19 '18 at 16:19
Also, the output that a stemmer/lemmatizer should give is usually subjective w.r.t. to the task you're performing. In fact, if you use a proper wordnet lemmatizer, it'll recognize `indentifier` as a lemma and `indentification` as a different lemma because they appear in wordnet and are different synsets. — alvas, Jun 19 '18 at 16:21

score 0 · Answer 1 · answered Jun 19 '18 at 16:25

The output that a stemmer/lemmatizer should give is usually subjective w.r.t. to the task you're performing. In fact, if you use a proper wordnet lemmatizer, it'll recognize identifier as a lemma and identification as a different lemma because they appear in wordnet and are different synsets.

>>> from nltk.stem import WordNetLemmatizer
>>> from nltk.corpus import wordnet as wn

>>> wnl = WordNetLemmatizer()


>>> wnl.lemmatize('identifier')
'identifier'
>>> wnl.lemmatize('identification')
'identification'

>>> wn.synsets('identification')
[Synset('designation.n.03'), Synset('identification.n.02'), Synset('identification.n.03'), Synset('recognition.n.02'), Synset('identification.n.05')]

>>> wn.synsets('identifier')
[Synset('identifier.n.01')]

The questions to ask are:

What are you going to use the stems/lemmas for? Which task?
Do you have an inventory of senses/concept with the lemmas that you should follow in your task?
What is your expected behavior for the stemmer/lemmatizer and why do you think it is so? Would it really matter if stemming or lemmatization is an intermediate task?

How to get stemmers to recognize Identification and Identifier similarly?

1 Answers1