i;ve tried wordnet lemmatizer, but i found that some common words like 'studying' or 'waiting' are not processed appropriately.
Am i missing something?
i;ve tried wordnet lemmatizer, but i found that some common words like 'studying' or 'waiting' are not processed appropriately.
Am i missing something?
As you can see on the online wordnet, studying and waiting are also nouns (as well as gerunds of verbs) and so it's not surprising that they can get lemmatized as themselves.
If that's unsatisfactory you need to find a more "aggressive" lemmatizer (one that deliberately ignores perfectly correct but "less likely" interpretations of a word), or, if you can first perform parts-of-speech tagging based on whole sentences, use a lemmatizer that can be told whether e.g. a given instance of studying
is a verb rather than a noun.
By default the WordNetLemmatizer
in NLTK assumes that the word is a NOUN
. see http://nltk.org/_modules/nltk/stem/wordnet.html
To correctly lemmatize verbs, you've to specify the pos
(part of speech)
>>> from nltk.stem.wordnet import WordNetLemmatizer
>>> wnl = WordNetLemmatizer()
>>> wnl.lemmatize('studying','v')
'study'
>>> wnl.lemmatize('studying','n')
'studying'
>>> wnl.lemmatize('studying')
'studying'
>>> wnl.lemmatize('waiting','n')
'waiting'
>>> wnl.lemmatize('waiting','v')
'wait'