1

I'm developing a simple NLP project, and I'm looking, given a text and a word, find the most likely sense of that word in the text.

Is there any implementation of wsd algorithms in Python? It's not quite clear whether there is something in NLTK that can help me. I'd be happy even with a naive implementation like Lesk Algorithm.

I've read similar questions like Word sense disambiguation in NLTK Python but they give nothing but a reference to a NLTK book, which is not very into WSD problem.

Community
  • 1
  • 1
finiteautomata
  • 3,753
  • 4
  • 31
  • 41
  • 1
    sorry for the -1, reposting the question is good to get better answers but still it's `first comes first serves` in questions on SO =). I'll +1 for you other question/answer in return. – alvas Jan 03 '14 at 10:22
  • Well, I suppose both questions are almost the same, but the problem I have is not within the context of a sentence, but of a whole text. I'm not an expert in NLP, but it might be slightly different. – finiteautomata Jan 03 '14 at 18:29
  • 1
    no worries, the longer the context, technically should give you better WSD results since num_overlaps increases or more information (syntactic/semantic) will be available.. but no one has theoretically test that out. BTW, I'm no expert in NLP too ;), i write some hacky scripts. – alvas Jan 03 '14 at 19:17
  • Just in case you want something that could plug-and-play, i've coded several lesk algorithms here: https://github.com/alvations/pywsd – alvas Jan 05 '14 at 11:31
  • Oh, thanks! I was just doing the same now that I understood lesk algorithm :) – finiteautomata Jan 05 '14 at 17:34
  • i hope the concept is more modular given the basic idea of counting overlaps and people have tried different weights, different "signatures" but the better algorithms are non lesk-like. i'll be coding more of it in the coming months. – alvas Jan 05 '14 at 17:43
  • 1
    Yep, I've read about this. But for the prototype I'm working on, this should be enough. https://github.com/geekazoid/wisdom => this is the repo. For the time being, I've only adapted your code using Lesk, and removing some noise (as stopwords). It can be installed through pip with: pip install -e git+https://github.com/geekazoid/wisdom#egg=wisdom – finiteautomata Jan 06 '14 at 04:03

3 Answers3

12

In short: https://github.com/alvations/pywsd

In long: There are endless techniques used for WSD, ranging from mind-blasting machine techniques that requires lots of GPU power to simply using information in wordnet or even just using frequencies, see http://dl.acm.org/citation.cfm?id=1459355 .

Let's start with the simple lesk algorithm that allows optional stems, see http://en.wikipedia.org/wiki/Lesk_algorithm:

from nltk.corpus import wordnet as wn
from nltk.stem import PorterStemmer
from itertools import chain

bank_sents = ['I went to the bank to deposit my money',
'The river bank was full of dead fishes']

plant_sents = ['The workers at the industrial plant were overworked',
'The plant was no longer bearing flowers']

ps = PorterStemmer()

def lesk(context_sentence, ambiguous_word, pos=None, stem=True, hyperhypo=True):
    max_overlaps = 0; lesk_sense = None
    context_sentence = context_sentence.split()
    for ss in wn.synsets(ambiguous_word):
        # If POS is specified.
        if pos and ss.pos is not pos:
            continue

        lesk_dictionary = []

        # Includes definition.
        lesk_dictionary+= ss.definition.split()
        # Includes lemma_names.
        lesk_dictionary+= ss.lemma_names

        # Optional: includes lemma_names of hypernyms and hyponyms.
        if hyperhypo == True:
            lesk_dictionary+= list(chain(*[i.lemma_names for i in ss.hypernyms()+ss.hyponyms()]))       

        if stem == True: # Matching exact words causes sparsity, so lets match stems.
            lesk_dictionary = [ps.stem(i) for i in lesk_dictionary]
            context_sentence = [ps.stem(i) for i in context_sentence] 

        overlaps = set(lesk_dictionary).intersection(context_sentence)

        if len(overlaps) > max_overlaps:
            lesk_sense = ss
            max_overlaps = len(overlaps)
    return lesk_sense

print "Context:", bank_sents[0]
answer = lesk(bank_sents[0],'bank')
print "Sense:", answer
print "Definition:",answer.definition
print

print "Context:", bank_sents[1]
answer = lesk(bank_sents[1],'bank','n')
print "Sense:", answer
print "Definition:",answer.definition
print

print "Context:", plant_sents[0]
answer = lesk(plant_sents[0],'plant','n', True)
print "Sense:", answer
print "Definition:",answer.definition
print

Other than lesk-like algorithms, there are different similarity measures that people tried, a good but out-dated but still useful survey: http://acl.ldc.upenn.edu/P/P97/P97-1008.pdf

alvas
  • 115,346
  • 109
  • 446
  • 738
2

You can try getting the first sense for each word using the WordNet incorporated in NLTK, using this short code:

from nltk.corpus import wordnet as wn

def get_first_sense(word, pos=None):
    if pos:
        synsets = wn.synsets(word,pos)
    else:
        synsets = wn.synsets(word)
    return synsets[0]

best_synset = get_first_sense('bank')
print '%s: %s' % (best_synset.name, best_synset.definition)
best_synset = get_first_sense('set','n')
print '%s: %s' % (best_synset.name, best_synset.definition)
best_synset = get_first_sense('set','v')
print '%s: %s' % (best_synset.name, best_synset.definition)

Will print:

bank.n.01: sloping land (especially the slope beside a body of water)
set.n.01: a group of things of the same kind that belong together and are so used
put.v.01: put into a certain place or abstract location

This works quite well, surprisingly, as the first sense is often dominating the other senses.

justhalf
  • 8,960
  • 3
  • 47
  • 74
0

For WSD in Python you can try to use Wordnet bindings in NLTK or Gensim library. The building blocks are there, but developing the complete algorithm is, probably, on you.

For instance, using Wordnet you can implement a Simplified Lesk algorithm, as described in the Wikipedia entry.

Vsevolod Dyomkin
  • 9,343
  • 2
  • 31
  • 36