3

I build a Plaintext-Corpus and the next step is to lemmatize all my texts. I'm using the WordNetLemmatizer and need the pos_tag for each token in order to do not get the Problem that e.g. loving -> lemma = loving and love -> lemma = love...


The default WordNetLemmatizer-POS-Tag is n (=Noun) i think, but how can i use the pos_tag? I think the expected WordNetLemmatizer-POS-Tag are diffrent to the pos_tag i get. Is there a function or something that can help me?!?!

in this line i think the word_pos is wrong and that's the error-reason

lemma = wordnet_lemmatizer.lemmatize(word,word_pos)

import nltk
from nltk.corpus import PlaintextCorpusReader
from nltk import sent_tokenize, word_tokenize, pos_tag
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

corpus_root = 'C:\\Users\\myname\\Desktop\\TestCorpus'
lyrics = PlaintextCorpusReader(corpus_root,'.*')

for fileid in lyrics.fileids():
     tokens = word_tokenize(lyrics.raw(fileid))
     tagged_tokens = pos_tag(tokens)
     for tagged_token in tagged_tokens:
         word = tagged_token[0]
         word_pos = tagged_token[1]
         print(tagged_token[0])
         print(tagged_token[1])
         lemma = wordnet_lemmatizer.lemmatize(word,pos=word_pos)
         print(lemma)

Additional Question: Is the pos_tag enough for my lemmatization or need i another tagger? My texts are lyrics...

ma-jo-ne
  • 147
  • 2
  • 12
  • 1
    I think your diagnosis is right; the nltk has gotten a new POS tagger, but the solution shown [here](http://stackoverflow.com/q/15586721/699305) should still work. If you're ok with the accuracy of `nltk.pos_tag()` (and if your text isn't so strange that it needs a custom tagger), you don't need to mess with installing another tagger. – alexis Mar 08 '16 at 19:44
  • 1
    (Shameless plug ;P) Try this: https://gist.github.com/alvations/07758d02412d928414bb ? It's a code snippet from https://github.com/alvations/pywsd – alvas Mar 08 '16 at 20:23
  • Maybe this is appropriate too https://github.com/alvations/pywsd/blob/master/pywsd/utils.py#L100 – alvas Mar 08 '16 at 20:25
  • This answer has a nice one-liner: https://stackoverflow.com/a/25544239/4549682 ; it is `wnpos = lambda e: ('a' if e[0].lower() == 'j' else e[0].lower()) if e[0].lower() in ['n', 'r', 'v'] else 'n'`. Then use wnpos(nltk_pos) to get the pos you can feed to .lemmatize() – wordsforthewise Dec 10 '17 at 00:12

1 Answers1

2

You need to convert the tag from the pos_tagger to one of the four "syntactic categories" that wordnet recognizes, then pass that to the lemmatizer as the word_pos.

From the docs:

Syntactic category: n for noun files, v for verb files, a for adjective files, r for adverb files.

Brendon McKeon
  • 401
  • 2
  • 8