27

I am new to NLTK Python and i am looking for some sample application which can do word sense disambiguation. I have got a lot of algorithms in search results but not a sample application. I just want to pass a sentence and want to know the sense of each word by referring to wordnet library. Thanks

I have found a similar module in PERL. http://marimba.d.umn.edu/allwords/allwords.html Is there such module present in NLTK Python?

thesensemakers
  • 309
  • 1
  • 5
  • 7

6 Answers6

19

Recently, part of the pywsd code has been ported into the bleeding edge version of NLTK' in the wsd.py module, try:

>>> from nltk.wsd import lesk
>>> sent = 'I went to the bank to deposit my money'
>>> ambiguous = 'bank'
>>> lesk(sent, ambiguous)
Synset('bank.v.04')
>>> lesk(sent, ambiguous).definition()
u'act as the banker in a game or in gambling'

For better WSD performance, use the pywsd library instead of the NLTK module. In general, simple_lesk() from pywsd does better than lesk from NLTK. I'll try to update the NLTK module as much as possible when I'm free.


In responds to Chris Spencer's comment, please note the limitations of Lesk algorithms. I'm simply giving an accurate implementation of the algorithms. It's not a silver bullet, http://en.wikipedia.org/wiki/Lesk_algorithm

Also please note that, although:

lesk("My cat likes to eat mice.", "cat", "n")

don't give you the right answer, you can use pywsd implementation of max_similarity():

>>> from pywsd.similarity import max_similiarity
>>> max_similarity('my cat likes to eat mice', 'cat', 'wup', pos='n').definition 
'feline mammal usually having thick soft fur and no ability to roar: domestic cats; wildcats'
>>> max_similarity('my cat likes to eat mice', 'cat', 'lin', pos='n').definition 
'feline mammal usually having thick soft fur and no ability to roar: domestic cats; wildcats'

@Chris, if you want a python setup.py , just do a polite request, i'll write it...

alvas
  • 115,346
  • 109
  • 446
  • 738
  • 1
    Unfortunately, the accuracy is pretty god-awful. `lesk("My cat likes to eat mice.", "cat", "n")` => `Synset('computerized_tomography.n.01')`. And pywsd doesn't even have an install script... – Cerin Aug 23 '14 at 02:47
  • 1
    Dear Chris, have you tried other variants of lesk? Esp. `simple_lesk()` or `adapted_lesk`? The original lesk is known to have problems, hence the other solutions that are available in the package. http://en.wikipedia.org/wiki/Lesk_algorithm. Also, I'm maintaining during my free time and it's not what i do for a living... – alvas Aug 23 '14 at 16:52
  • 1
    Yes, I tried every variant of Lesk in your package, and none worked on my sample corpus. I had to create a variant that also used glosses from all hyponyms and meronyms linked to the word just to get a handful of positive results, but even then it was only 15% accurate. It's not your code, it's Lesk that's the problem. It's simply not a reliable heuristic. – Cerin Aug 24 '14 at 23:32
  • try maximizing the similarity. It might do better. Also i'm coding more algorithms but that's left for code-sprint in Sept. Also, take a look at more state-of-art methods. Lastly, Most Frequent Sense usually does pretty well and state-of-art manage to beat it by 1-2% at most 5% when it backs-off with MFS... – alvas Aug 25 '14 at 06:00
  • Guys, would it make sense to take a manually labelled corpus (with synsets disambiguated by a human) and train some kind of ML classificator on it? then trained classificator could be included into your package as yet another disambiguation algo, if we see its accuracy on texts unseen during training is high ) – Anatoly Alekseev Jan 18 '18 at 15:48
  • @AnatolyAlekseev see http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=83E3821F0E2E401317B4912AD0073804?doi=10.1.1.300.8098&rep=rep1&type=pdf – alvas Jan 18 '18 at 23:49
  • @alvas While importing the module, why does it say it needs to warm up? Takes about 25 seconds (and makes every other app unresponsive) every single time I import. – John Strood Sep 24 '18 at 13:53
  • It only loads once. It will save you a lot of time later on after it's loaded =) – alvas Sep 24 '18 at 22:17
  • @alvas Thanks. No offense, but loading the ~80 MB pickle makes everything unresponsive, makes me wanna throw my laptop :P – John Strood Sep 26 '18 at 12:49
  • There's a way for it to not load but the time to run the functions will increase tremendously. And 80MB is really nothing todsy, neural anything today takes up 100s MB to GB. – alvas Sep 26 '18 at 13:54
8

Yes, in fact, there is a book that the NLTK team wrote which has multiple chapters on classification and they explicitly cover how to use WordNet. You can also buy a physical version of the book from Safari.

FYI: NLTK is written by natural language programming academics for use in their introductory programming courses.

Indolering
  • 3,058
  • 30
  • 45
3

As a practical answer to the OP's request, here's a python implementation of several WSD methods that returns senses in form of NLTK's synset(s), https://github.com/alvations/pywsd

It includes

  • Lesk algorithms (includes original Lesk, adapted Lesk and simple Lesk)
  • Baseline algorithms (random sense, first sense, Most Frequent Sense)

It can be used as such:

#!/usr/bin/env python -*- coding: utf-8 -*-

bank_sents = ['I went to the bank to deposit my money',
'The river bank was full of dead fishes']

plant_sents = ['The workers at the industrial plant were overworked',
'The plant was no longer bearing flowers']

print "======== TESTING simple_lesk ===========\n"
from lesk import simple_lesk
print "#TESTING simple_lesk() ..."
print "Context:", bank_sents[0]
answer = simple_lesk(bank_sents[0],'bank')
print "Sense:", answer
print "Definition:",answer.definition
print

print "#TESTING simple_lesk() with POS ..."
print "Context:", bank_sents[1]
answer = simple_lesk(bank_sents[1],'bank','n')
print "Sense:", answer
print "Definition:",answer.definition
print

print "#TESTING simple_lesk() with POS and stems ..."
print "Context:", plant_sents[0]
answer = simple_lesk(plant_sents[0],'plant','n', True)
print "Sense:", answer
print "Definition:",answer.definition
print

print "======== TESTING baseline ===========\n"
from baseline import random_sense, first_sense
from baseline import max_lemma_count as most_frequent_sense

print "#TESTING random_sense() ..."
print "Context:", bank_sents[0]
answer = random_sense('bank')
print "Sense:", answer
print "Definition:",answer.definition
print

print "#TESTING first_sense() ..."
print "Context:", bank_sents[0]
answer = first_sense('bank')
print "Sense:", answer
print "Definition:",answer.definition
print

print "#TESTING most_frequent_sense() ..."
print "Context:", bank_sents[0]
answer = most_frequent_sense('bank')
print "Sense:", answer
print "Definition:",answer.definition
print

[out]:

======== TESTING simple_lesk ===========

#TESTING simple_lesk() ...
Context: I went to the bank to deposit my money
Sense: Synset('depository_financial_institution.n.01')
Definition: a financial institution that accepts deposits and channels the money into lending activities

#TESTING simple_lesk() with POS ...
Context: The river bank was full of dead fishes
Sense: Synset('bank.n.01')
Definition: sloping land (especially the slope beside a body of water)

#TESTING simple_lesk() with POS and stems ...
Context: The workers at the industrial plant were overworked
Sense: Synset('plant.n.01')
Definition: buildings for carrying on industrial labor

======== TESTING baseline ===========
#TESTING random_sense() ...
Context: I went to the bank to deposit my money
Sense: Synset('deposit.v.02')
Definition: put into a bank account

#TESTING first_sense() ...
Context: I went to the bank to deposit my money
Sense: Synset('bank.n.01')
Definition: sloping land (especially the slope beside a body of water)

#TESTING most_frequent_sense() ...
Context: I went to the bank to deposit my money
Sense: Synset('bank.n.01')
Definition: sloping land (especially the slope beside a body of water)
alvas
  • 115,346
  • 109
  • 446
  • 738
  • There is no benefit of association in the above sample of the `context` with the `random_sense`, `first_sense` and `most_frequent_sense` functions. Sorry but is a bit confusing, those implementations are context independent and have a different meaning. `simple_lesk` is the only one that takes context as an input. – Dan M Jul 10 '22 at 15:34
  • The other implementations are very simple baselines =) – alvas Jul 14 '22 at 10:21
0

NLTK has apis to access Wordnet. Wordnet places words as synsets. This would give you some information on the word, its hypernyms, hyponyms, root word etc.

"Python Text Processing with NLTK 2.0 Cookbook" is a good book to get you started on various features of NLTK. It is easy to read, understand and implement.

Also, you can look at other papers(outside the realm of NLTK) which talks about using wikipedia for word sense disambiguation.

sprezzatura
  • 472
  • 5
  • 17
-1

Yes it is possible with the wordnet module in NLTK. The similarity mesures which used in the tool which mentioned in your post exists in NLTK wordnet module too.

Jaggu
  • 9