2

I'm trying to lemmatize a string according to the part of speech but at the final stage, i'm getting an error. My code:

import nltk
from nltk.stem import *
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import wordnet
wordnet_lemmatizer = WordNetLemmatizer()
text = word_tokenize('People who help the blinging lights are the way of the future and are heading properly to their goals')
tagged = nltk.pos_tag(text)

def get_wordnet_pos(treebank_tag):

    if treebank_tag.startswith('J'):
        return wordnet.ADJ
    elif treebank_tag.startswith('V'):
        return wordnet.VERB
    elif treebank_tag.startswith('N'):
        return wordnet.NOUN
    elif treebank_tag.startswith('R'):
        return wordnet.ADV
    else:
        return ''

for word in tagged: print(wordnet_lemmatizer.lemmatize(word,pos='v'), end=" ")
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-40-afb22c78f770> in <module>()
----> 1 for word in tagged: print(wordnet_lemmatizer.lemmatize(word,pos='v'), end=" ")

E:\Miniconda3\envs\uol1\lib\site-packages\nltk\stem\wordnet.py in lemmatize(self, word, pos)
     38 
     39     def lemmatize(self, word, pos=NOUN):
---> 40         lemmas = wordnet._morphy(word, pos)
     41         return min(lemmas, key=len) if lemmas else word
     42 

E:\Miniconda3\envs\uol1\lib\site-packages\nltk\corpus\reader\wordnet.py in _morphy(self, form, pos)
   1710 
   1711         # 1. Apply rules once to the input to get y1, y2, y3, etc.
-> 1712         forms = apply_rules([form])
   1713 
   1714         # 2. Return all that are in the database (and check the original too)

E:\Miniconda3\envs\uol1\lib\site-packages\nltk\corpus\reader\wordnet.py in apply_rules(forms)
   1690         def apply_rules(forms):
   1691             return [form[:-len(old)] + new
-> 1692                     for form in forms
   1693                     for old, new in substitutions
   1694                     if form.endswith(old)]

E:\Miniconda3\envs\uol1\lib\site-packages\nltk\corpus\reader\wordnet.py in <listcomp>(.0)
   1692                     for form in forms
   1693                     for old, new in substitutions
-> 1694                     if form.endswith(old)]
   1695 
   1696         def filter_forms(forms):

I want to be able to lemmatize that string based on each word's part of speech all at once. Please help.

  • I don't quite understand your approach: you want to lemmatize words, after checking for their POS to make sure you get the right lemma, is that it? If so, can you give an expected input & output? Also, what is the point of `get_wordnet_pos()` - I don't see it used anywhere? – patrick Jan 27 '17 at 14:40
  • Take a look at https://gist.github.com/alvations/07758d02412d928414bb – alvas Feb 02 '17 at 21:12

1 Answers1

1

Firstly, try not to mix top-level, absolute and relative imports like these:

import nltk
from nltk.stem import *
from nltk import pos_tag, word_tokenize

This would be better:

from nltk import sent_tokenize, word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet as wn

(See Absolute vs. explicit relative import of Python module)

The error you're getting is most probably because you are feeding in the outputs of pos_tag as the input to the WordNetLemmatizer.lemmatize(), i.e. :

>>> from nltk import pos_tag
>>> from nltk.stem import WordNetLemmatizer

>>> wnl = WordNetLemmatizer()
>>> sent = 'People who help the blinging lights are the way of the future and are heading properly to their goals'.split()

>>> pos_tag(sent)
[('People', 'NNS'), ('who', 'WP'), ('help', 'VBP'), ('the', 'DT'), ('blinging', 'NN'), ('lights', 'NNS'), ('are', 'VBP'), ('the', 'DT'), ('way', 'NN'), ('of', 'IN'), ('the', 'DT'), ('future', 'NN'), ('and', 'CC'), ('are', 'VBP'), ('heading', 'VBG'), ('properly', 'RB'), ('to', 'TO'), ('their', 'PRP$'), ('goals', 'NNS')]
>>> pos_tag(sent)[0]
('People', 'NNS')

>>> first_word = pos_tag(sent)[0]
>>> wnl.lemmatize(first_word)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/nltk/stem/wordnet.py", line 40, in lemmatize
    lemmas = wordnet._morphy(word, pos)
  File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/wordnet.py", line 1712, in _morphy
    forms = apply_rules([form])
  File "/usr/local/lib/python2.7/dist-packages/nltk/corpus/reader/wordnet.py", line 1694, in apply_rules
    if form.endswith(old)]
AttributeError: 'tuple' object has no attribute 'endswith'

The input to WordNetLemmatizer.lemmatize() should be str not a tuple, so if you do:

>>> tagged_sent = pos_tag(sent)

>>> def penn2morphy(penntag, returnNone=False):
...     morphy_tag = {'NN':wn.NOUN, 'JJ':wn.ADJ,
...                   'VB':wn.VERB, 'RB':wn.ADV}
...     try:
...         return morphy_tag[penntag[:2]]
...     except:
...         return None if returnNone else ''
... 

>>> for word, tag in tagged_sent:
...     wntag = penn2morphy(tag)
...     if wntag:
...         print wnl.lemmatize(word, pos=wntag)
...     else:
...         print word
... 
People
who
help
the
blinging
light
be
the
way
of
the
future
and
be
head
properly
to
their
goal

Or if you like an easy way out:

pip install pywsd

Then:

>>> from pywsd.utils import lemmatize, lemmatize_sentence
>>> sent = 'People who help the blinging lights are the way of the future and are heading properly to their goals'
>>> lemmatize_sentence(sent)
['people', 'who', 'help', 'the', u'bling', u'light', u'be', 'the', 'way', 'of', 'the', 'future', 'and', u'be', u'head', 'properly', 'to', 'their', u'goal']
Community
  • 1
  • 1
alvas
  • 115,346
  • 109
  • 446
  • 738