NLTK. Lesk returns different result for the same input

Question

I'm using LESK algorithm for getting SynSets from the text. But I'm getting different results with the same inputs. Is it Lesk algorithm "feature" or am I doing something wrong? Next is the code I'm using:

    self.SynSets =[]
    sentences = sent_tokenize("Python is a widely used general-purpose, high-level programming language.\
        Its design philosophy emphasizes code readability, and its syntax allows programmers to express concepts in fewer lines of code than would be possible in languages such as C++ or Java.\
        The language provides constructs intended to enable clear programs on both a small and large scale.\
        Python supports multiple programming paradigms, including object-oriented, imperative and functional programming or procedural styles.\
        ")
    stopwordsList =  stopwords.words('english')
    self.sentNum=0;
    for sentence in sentences:
        raw_tokens =  word_tokenize(sentence)
        final_tokens = [token.lower() for token in raw_tokens 
                    if(not token in stopwordsList) 
                    #and (len(token) > 3) 
                    and not token.isdigit()]
        for token in final_tokens:
            synset = wsd.lesk(sentence, token)
            if not synset is None:
                self.SynSets.append(synset)

    self.SynSets = set(self.SynSets)
    self.WriteSynSets()
    return self

At the output I'm having results (first 3 results from 2 different launching):

Synset('allow.v.09')   Synset('code.n.03')   Synset('coffee.n.01') 
------------
Synset('allow.v.09')   Synset('argumentation.n.02')   Synset('boastfully.r.01')

If there is some another (more stable) way to get synsets, I will be thankfull for yours help.

Thanks in advance.

Edited

For additional example here is full script that I've ran for 2 times:

import nltk
from nltk.tokenize import sent_tokenize
from nltk import word_tokenize
from nltk import wsd
from nltk.corpus import stopwords

SynSets =[]
sentences = sent_tokenize("Python is a widely used general-purpose, high-level programming language.\
    Its design philosophy emphasizes code readability, and its syntax allows programmers to express concepts in fewer lines of code than would be possible in languages such as C++ or Java.\
    The language provides constructs intended to enable clear programs on both a small and large scale.\
    Python supports multiple programming paradigms, including object-oriented, imperative and functional programming or procedural styles.\
    ")
stopwordsList =  stopwords.words('english')

for sentence in sentences:
    raw_tokens =  word_tokenize(sentence)#WordPunctTokenizer().tokenize(sentence)
    #removing stopwords and words, smaller than 3 characters
    final_tokens = [token.lower() for token in raw_tokens 
                if(not token in stopwordsList) 
                #and (len(token) > 3) 
                and not token.isdigit()]
    for token in final_tokens:
        synset = wsd.lesk(sentence, token)
        if not synset is None:
            SynSets.append(synset)


SynSets = set(SynSets)

SynSets = sorted(SynSets)
with open("synsets.txt", "a") as file:
    file.write("\n-------------------\n")
    for synset in SynSets:
        file.write("{}   ".format(str(synset.__str__())))
file.close()

and I had these results (first 4 resulted synsets that was written in the file for each of 2 times tat i ran program):

Synset('allow.v.04') Synset('boastfully.r.01') Synset('clear.v.11') Synset('code.n.02')
Synset('boastfully.r.01') Synset('clear.v.19') Synset('code.n.01') Synset('design.n.04')

SOLUTION: I've got what was the problem. After re-installing python 2.7 all problems has gone. So, don't use python 3.x with lesk algorithm.

alvas · Answer 1 · 2015-02-11T07:43:55.623

There is a wsd function for lesk algorithm in the latest version of NLTK:

>>> from nltk.wsd import lesk
>>> from nltk import sent_tokenize
>>> text = "Python is a widely used general-purpose, high-level programming language. Its design philosophy emphasizes code readability, and its syntax allows programmers to express concepts in fewer lines of code than would be possible in languages such as C++ or Java. The language provides constructs intended to enable clear programs on both a small and large scale. Python supports multiple programming paradigms, including object-oriented, imperative and functional programming or procedural styles."
>>> for sent in sent_tokenize(text):
...     for word in word_tokenize(sent):
...             print word, lesk(sent, word), sent

[out]:

Python Synset('python.n.02') Python is a widely used general-purpose, high-level programming language.
is Synset('be.v.08') Python is a widely used general-purpose, high-level programming language.
a Synset('angstrom.n.01') Python is a widely used general-purpose, high-level programming language.
widely Synset('wide.r.04') Python is a widely used general-purpose, high-level programming language.
used Synset('use.v.01') Python is a widely used general-purpose, high-level programming language.
general-purpose None Python is a widely used general-purpose, high-level programming language.
, None Python is a widely used general-purpose, high-level programming language.

Also, try the disambiguate() from pywsd (https://github.com/alvations/pywsd):

>>> from pywsd import disambiguate>>> from nltk import sent_tokenize
>>> text = "Python is a widely used general-purpose, high-level programming language. Its design philosophy emphasizes code readability, and its syntax allows programmers to express concepts in fewer lines of code than would be possible in languages such as C++ or Java. The language provides constructs intended to enable clear programs on both a small and large scale. Python supports multiple programming paradigms, including object-oriented, imperative and functional programming or procedural styles."
>>> for sent in sent_tokenize(text):
...     print disambiguate(sent, prefersNone=True)
...

[out]:

[('Python', Synset('python.n.02')), ('is', None), ('a', None), ('widely', Synset('widely.r.03')), ('used', Synset('used.a.01')), ('general-purpose', None), (',', None), ('high-level', None), ('programming', Synset('scheduling.n.01')), ('language', Synset('terminology.n.01')), ('.', None)]
[('Its', None), ('design', Synset('purpose.n.01')), ('philosophy', Synset('philosophy.n.03')), ('emphasizes', Synset('stress.v.01')), ('code', Synset('code.n.03')), ('readability', Synset('readability.n.01')), (',', None), ('and', None), ('its', None), ('syntax', Synset('syntax.n.03')), ('allows', Synset('let.v.01')), ('programmers', Synset('programmer.n.01')), ('to', None), ('express', Synset('express.n.03')), ('concepts', Synset('concept.n.01')), ('in', None), ('fewer', None), ('lines', Synset('wrinkle.n.01')), ('of', None), ('code', Synset('code.n.03')), ('than', None), ('would', None), ('be', None), ('possible', Synset('potential.a.01')), ('in', None), ('languages', Synset('linguistic_process.n.02')), ('such', None), ('as', None), ('C++', None), ('or', None), ('Java', Synset('java.n.03')), ('.', None)]
[('The', None), ('language', Synset('language.n.01')), ('provides', Synset('provide.v.06')), ('constructs', Synset('concept.n.01')), ('intended', Synset('mean.v.03')), ('to', None), ('enable', None), ('clear', Synset('open.n.01')), ('programs', Synset('program.n.08')), ('on', None), ('both', None), ('a', None), ('small', Synset('small.a.01')), ('and', None), ('large', Synset('large.a.01')), ('scale', Synset('scale.n.10')), ('.', None)]
[('Python', Synset('python.n.02')), ('supports', Synset('support.n.11')), ('multiple', None), ('programming', Synset('program.v.02')), ('paradigms', Synset('substitution_class.n.01')), (',', None), ('including', Synset('include.v.03')), ('object-oriented', None), (',', None), ('imperative', Synset('imperative.a.02')), ('and', None), ('functional', Synset('functional.a.01')), ('programming', Synset('scheduling.n.01')), ('or', None), ('procedural', Synset('procedural.a.01')), ('styles', Synset('vogue.n.01')), ('.', None)]

They're not perfect but they are close to the accurate implementation of lesk.

EDITED

To verify that it's the same results everytime you run, there should be no STDOUT when you do this:

from nltk.wsd import lesk
from nltk import sent_tokenize, word_tokenize
text = "Python is a widely used general-purpose, high-level programming language. Its design philosophy emphasizes code readability, and its syntax allows programmers to express concepts in fewer lines of code than would be possible in languages such as C++ or Java. The language provides constructs intended to enable clear programs on both a small and large scale. Python supports multiple programming paradigms, including object-oriented, imperative and functional programming or procedural styles."

lst = []
for sent in sent_tokenize(text):
    lst = []
    for word in word_tokenize(sent):
        lst.append(lesk(sent, word))
    for i in range(10):
        lst2 = []
        for word in word_tokenize(sent):
            lst2.append(lesk(sent, word))
        assert lst2 == lst

I ran the OP's code 10 times but it yields the same result:

import nltk
from nltk.tokenize import sent_tokenize
from nltk import word_tokenize
from nltk import wsd
from nltk.corpus import stopwords

def run():
    SynSets =[]
    sentences = sent_tokenize("Python is a widely used general-purpose, high-level programming language.\
        Its design philosophy emphasizes code readability, and its syntax allows programmers to express concepts in fewer lines of code than would be possible in languages such as C++ or Java.\
        The language provides constructs intended to enable clear programs on both a small and large scale.\
        Python supports multiple programming paradigms, including object-oriented, imperative and functional programming or procedural styles.\
        ")
    stopwordsList =  stopwords.words('english')

    for sentence in sentences:
        raw_tokens =  word_tokenize(sentence)#WordPunctTokenizer().tokenize(sentence)
        #removing stopwords and words, smaller than 3 characters
        final_tokens = [token.lower() for token in raw_tokens 
                    if(not token in stopwordsList) 
                    #and (len(token) > 3) 
                    and not token.isdigit()]
        for token in final_tokens:
            synset = wsd.lesk(sentence, token)
            if not synset is None:
                SynSets.append(synset)
    return sorted(set(SynSets))

run1 = run()

for i in range(10):
    assert run1 == run()

I have NLTK 3.0.1 installed. And I'm using wsd.lesk from NLTK. Probem is different result in output. Did you have similar issue? For pywsd - thanks. I'll try it — MisterMe, Feb 10 '15 at 21:35
Are you calling it correctly? There's no different when i run it 10 times. How did you call lesk from NLTK? — alvas, Feb 10 '15 at 21:52
I have difference when i relaunch the program. You can see how i'm usin lesk from example. — MisterMe, Feb 10 '15 at 22:19
It's not that different, using your code, i ran your code 10 times and it's still the same. — alvas, Feb 11 '15 at 07:44
I Ran your code as well. It finished without any exceptions. But then i ran my code(relaunching. with finishing program to the end) and ir did have different results. What is wrong with this?( — MisterMe, Feb 11 '15 at 22:13
i ran your code without changing anything and i got this output: http://pastebin.com/PurJLhCH . There's still no difference, is there any other part of your code changing the synsets list? — alvas, Feb 12 '15 at 06:31
If all else fails, try the pywsd disambiguate module. I'm quite sure it's stable after several round of testing and also the author of the module is the same nltk.wsd.lesk ;) — alvas, Feb 12 '15 at 06:36
Thanks for responce. I've tried to run my code from the different machine. As same issue - different results. Tried to install pywsd (downloaded from github and copied pywsd folder into the c:/python34. Than - when trying to import - lesk module not found error occures) Also i've tried to run code with nltk.lesk - just for the 1 sentence - having diffrent result. — MisterMe, Feb 13 '15 at 18:02
Also I've tried to update nltk and python - didn't help. I'm running my code from VS 2013. Also I've uploaded all my code here: https://drive.google.com/folderview?id=0B8AKmvfIdZnRREE3U21PeDlveW8&usp=sharing (code in SynSetTable file). Thanks — MisterMe, Feb 13 '15 at 18:09
I gona reinstall Windows if it help) I just cannot understand why is so determinic code behaves in so unpredictible way. — MisterMe, Feb 13 '15 at 18:16
sorry i went for a long holiday for cny. I was using the nltk v3.0. But i guess @dimazest should have it fixed. — alvas, Feb 25 '15 at 18:11

NLTK. Lesk returns different result for the same input

1 Answers1