0

Why does NLTK's stemmers identify a different stem for Identification and Identifier? For Identification, both the Snowball and Porter stemmers return identif, but for Identifier, I get identifi. Are there any other stemmers that would be a bit more inclusive of different forms of words?

  • See https://stackoverflow.com/questions/17317418/stemmers-vs-lemmatizers/28954002 – alvas Jun 19 '18 at 16:19
  • Also, the output that a stemmer/lemmatizer should give is usually subjective w.r.t. to the task you're performing. In fact, if you use a proper wordnet lemmatizer, it'll recognize `indentifier` as a lemma and `indentification` as a different lemma because they appear in wordnet and are different synsets. – alvas Jun 19 '18 at 16:21

1 Answers1

0

The output that a stemmer/lemmatizer should give is usually subjective w.r.t. to the task you're performing. In fact, if you use a proper wordnet lemmatizer, it'll recognize identifier as a lemma and identification as a different lemma because they appear in wordnet and are different synsets.

>>> from nltk.stem import WordNetLemmatizer
>>> from nltk.corpus import wordnet as wn

>>> wnl = WordNetLemmatizer()


>>> wnl.lemmatize('identifier')
'identifier'
>>> wnl.lemmatize('identification')
'identification'

>>> wn.synsets('identification')
[Synset('designation.n.03'), Synset('identification.n.02'), Synset('identification.n.03'), Synset('recognition.n.02'), Synset('identification.n.05')]

>>> wn.synsets('identifier')
[Synset('identifier.n.01')]

The questions to ask are:

  • What are you going to use the stems/lemmas for? Which task?

  • Do you have an inventory of senses/concept with the lemmas that you should follow in your task?

  • What is your expected behavior for the stemmer/lemmatizer and why do you think it is so? Would it really matter if stemming or lemmatization is an intermediate task?

alvas
  • 115,346
  • 109
  • 446
  • 738