2

I am working with nltk's default tagger to get a POS tag of the word but I am not getting the expected results:

>>> nltk.pos_tag(nltk.tokenize.word_tokenize("I want a watch"))
[('I', 'PRP'), ('want', 'VBP'), ('a', 'DT'), ('watch', 'NN')]
>>> nltk.pos_tag(nltk.tokenize.word_tokenize("Lets watch a movie"))
[('Lets', 'NNS'), ('watch', 'VBP'), ('a', 'DT'), ('movie', 'NN')]

As you can see above, the pos_tag function correctly tags the word watch. But in the below case:

>>> nltk.pos_tag(nltk.tokenize.word_tokenize("I want to read a book"))
[('I', 'PRP'), ('want', 'VBP'), ('to', 'TO'), ('read', 'VB'), ('a', 'DT'), ('book', 'NN')]

>>> nltk.pos_tag(nltk.tokenize.word_tokenize("I want to book a ticket"))
[('I', 'PRP'), ('want', 'VBP'), ('to', 'TO'), ('book', 'NN'), ('a', 'DT'), ('ticket', 'NN')]

It incorrectly predicts the tag for the word book. I know we can build a custom tagger but I would not prefer build a tagger from scratch just for one word. I am looking to improve the accuracy of the tagger for the word book. I referred to this answer but the latest version doesn't seem to have the method nltk.tag._POS_TAGGER.

Is there any possible workaround for this?

Community
  • 1
  • 1
TerminalWitchcraft
  • 1,732
  • 1
  • 13
  • 18
  • I have it correctly tagged as `'VB'` on my machine (NLTK3). Check this [Python NLTK pos_tag not returning the correct part-of-speech tag](http://stackoverflow.com/questions/30821188/python-nltk-pos-tag-not-returning-the-correct-part-of-speech-tag) – Moses Koledoye Jul 15 '16 at 13:25
  • @MosesKoledoye I agree with your comment that NLTK is not perfect. But I want to modify the existing algorithm's weight so that it correctly assigns tag for the word "book" – TerminalWitchcraft Jul 15 '16 at 13:29
  • FWIW, the Stanford POS tagger (while slower) has provided me with much much better results. The default tagger isn't even able to process "The quick brown fox jumped over the lazy dog" correctly. – Athena Jul 15 '16 at 15:44
  • @Ares Thanks for your insight. I will give a try to Stanford tagger as well!! – TerminalWitchcraft Jul 16 '16 at 04:18

1 Answers1

2

NLTK pos_tag uses the PerceptronTagger by default. But you can use other taggers which have been trained on their respective datasets.

In the following case, the treebank pos tagger was used:

import nltk

tagger = nltk.data.load('taggers/maxent_treebank_pos_tagger/english.pickle')
t = tagger.tag(nltk.tokenize.word_tokenize("I want to book a ticket"))
print(t)
# [('I', 'PRP'), ('want', 'VBP'), ('to', 'TO'), ('book', 'VB'), ('a', 'DT'), ('ticket', 'NN')]
#                                                         ^^ rightly tagged as verb

You can change tagger if you still don't get the desired results.

One can also evaluate the tagger on a corpus to get an idea of expected accuracies:

>>> corpus = nltk.corpus.treebank.tagged_sents()
>>> tagger.evaluate(corpus)
0.9956891414041082
Moses Koledoye
  • 77,341
  • 8
  • 133
  • 139
  • Your answer seems fair enough. Is there any way to train an existing trained tagger? In your answer, can we further train the pre-trained treebank pos_tagger on another corpus? – TerminalWitchcraft Jul 16 '16 at 04:16
  • @HiteshPaul If this answer solved your problem, you may consider accepting it. There should be workarounds to improve the accuracy of taggers but that isn't part of the original question – Moses Koledoye Jul 16 '16 at 11:25
  • I tried using the sentence "Hey, will you book a show for me please?" but it give incorrect tags. Is there a way to train a pre-trained tagger?? – TerminalWitchcraft Jul 18 '16 at 06:12
  • @HiteshPaul I haven't found a way to retrain taggers, but you can keep trying other taggers until you get good results, checkout the [Stanford POS tagger](http://www.nltk.org/_modules/nltk/tag/stanford.html) – Moses Koledoye Jul 18 '16 at 09:24