I'm using LESK algorithm for getting SynSets from the text. But I'm getting different results with the same inputs. Is it Lesk algorithm "feature" or am I doing something wrong? Next is the code I'm using:
self.SynSets =[]
sentences = sent_tokenize("Python is a widely used general-purpose, high-level programming language.\
Its design philosophy emphasizes code readability, and its syntax allows programmers to express concepts in fewer lines of code than would be possible in languages such as C++ or Java.\
The language provides constructs intended to enable clear programs on both a small and large scale.\
Python supports multiple programming paradigms, including object-oriented, imperative and functional programming or procedural styles.\
")
stopwordsList = stopwords.words('english')
self.sentNum=0;
for sentence in sentences:
raw_tokens = word_tokenize(sentence)
final_tokens = [token.lower() for token in raw_tokens
if(not token in stopwordsList)
#and (len(token) > 3)
and not token.isdigit()]
for token in final_tokens:
synset = wsd.lesk(sentence, token)
if not synset is None:
self.SynSets.append(synset)
self.SynSets = set(self.SynSets)
self.WriteSynSets()
return self
At the output I'm having results (first 3 results from 2 different launching):
Synset('allow.v.09') Synset('code.n.03') Synset('coffee.n.01')
------------
Synset('allow.v.09') Synset('argumentation.n.02') Synset('boastfully.r.01')
If there is some another (more stable) way to get synsets, I will be thankfull for yours help.
Thanks in advance.
Edited
For additional example here is full script that I've ran for 2 times:
import nltk
from nltk.tokenize import sent_tokenize
from nltk import word_tokenize
from nltk import wsd
from nltk.corpus import stopwords
SynSets =[]
sentences = sent_tokenize("Python is a widely used general-purpose, high-level programming language.\
Its design philosophy emphasizes code readability, and its syntax allows programmers to express concepts in fewer lines of code than would be possible in languages such as C++ or Java.\
The language provides constructs intended to enable clear programs on both a small and large scale.\
Python supports multiple programming paradigms, including object-oriented, imperative and functional programming or procedural styles.\
")
stopwordsList = stopwords.words('english')
for sentence in sentences:
raw_tokens = word_tokenize(sentence)#WordPunctTokenizer().tokenize(sentence)
#removing stopwords and words, smaller than 3 characters
final_tokens = [token.lower() for token in raw_tokens
if(not token in stopwordsList)
#and (len(token) > 3)
and not token.isdigit()]
for token in final_tokens:
synset = wsd.lesk(sentence, token)
if not synset is None:
SynSets.append(synset)
SynSets = set(SynSets)
SynSets = sorted(SynSets)
with open("synsets.txt", "a") as file:
file.write("\n-------------------\n")
for synset in SynSets:
file.write("{} ".format(str(synset.__str__())))
file.close()
and I had these results (first 4 resulted synsets that was written in the file for each of 2 times tat i ran program):
Synset('allow.v.04') Synset('boastfully.r.01') Synset('clear.v.11') Synset('code.n.02')
Synset('boastfully.r.01') Synset('clear.v.19') Synset('code.n.01') Synset('design.n.04')
SOLUTION: I've got what was the problem. After re-installing python 2.7 all problems has gone. So, don't use python 3.x with lesk algorithm.