0

i;ve tried wordnet lemmatizer, but i found that some common words like 'studying' or 'waiting' are not processed appropriately.

Am i missing something?

goh
  • 27,631
  • 28
  • 89
  • 151

2 Answers2

3

As you can see on the online wordnet, studying and waiting are also nouns (as well as gerunds of verbs) and so it's not surprising that they can get lemmatized as themselves.

If that's unsatisfactory you need to find a more "aggressive" lemmatizer (one that deliberately ignores perfectly correct but "less likely" interpretations of a word), or, if you can first perform parts-of-speech tagging based on whole sentences, use a lemmatizer that can be told whether e.g. a given instance of studying is a verb rather than a noun.

Alex Martelli
  • 854,459
  • 170
  • 1,222
  • 1,395
  • hmmm is it more sensible to use a more aggressive one, like u mentioned, like the porter stemmer, or do a pos tagging first. I'm worried about the performance because theres quite a number of chunks of text i need to handle? – goh Jun 09 '10 at 01:29
  • @goh, POS-tagging is not fast, but it IS going to be more accurate -- you probably don't want to see stem "awn" for an "awning", I suspect. But, will you always have the words in the context of a well-formed sentence, or do you need to deal with them in isolation sometimes? if the latter, then the aggressive stemmer is what's left...:-(. – Alex Martelli Jun 09 '10 at 03:28
  • actually im doing a classification on blogs. I need to infer from their blog content on whether they are from my school. I have a couple of blogs whereby i could start crawling from. The rest would be classified. I'm doing html stripping, then word tokenising, follow by pos tagging, filtering all except the nouns, and lemmatising them. The features for the classifier would be the nouns i guess. Is that a good approach? – goh Jun 09 '10 at 06:40
  • @goh, worth a try (the problem is, after all, **extremely** hard) -- but if you're POS tagging anyway to get the nouns, then -- why is keeping e.g. the noun `waiting` as its own stem (which as a noun it is) at all a problem?! – Alex Martelli Jun 09 '10 at 14:34
2

By default the WordNetLemmatizer in NLTK assumes that the word is a NOUN. see http://nltk.org/_modules/nltk/stem/wordnet.html

To correctly lemmatize verbs, you've to specify the pos (part of speech)

>>> from nltk.stem.wordnet import WordNetLemmatizer
>>> wnl = WordNetLemmatizer()
>>> wnl.lemmatize('studying','v')
'study'
>>> wnl.lemmatize('studying','n')
'studying'
>>> wnl.lemmatize('studying')
'studying'
>>> wnl.lemmatize('waiting','n')
'waiting'
>>> wnl.lemmatize('waiting','v')
'wait'
alvas
  • 115,346
  • 109
  • 446
  • 738