I am applying wordNet lemmatizer into my corpus and I need to define the pos tagger for lemmatizer:
stemmer = PorterStemmer()
def lemmitize(document):
return stemmer.stem(WordNetLemmatizer().lemmatize(document, pos='v'))
def preprocess(document):
output = []
for token in gensim.utils.simple_preprocess(document):
if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
print("lemmitize: ", lemmitize(token))
output.append(lemmitize(token))
return output
Now as you can see I am defining pos for verb (and I know wordNet default pos is a noun), however when I lemmatized my document:
the left door closed at the night
I am getting out put as:
output: ['leav', 'door', 'close', 'night']
which this is not what i was expecting. In my above sentences, left
points to which door (e.g. right or left). If I choose pos ='n'
this problem may solve but it will then act as a wornNet default and there will be no effects on words like taken
.
I found a similar issue in here and I modified the exception list in nltk_data/corpora/wordnet/verb.exc
and I changed left leave
to left left
but still, I am getting the same results as leav
.
Now I am wondering if there is any solution to this problem or in the best case, is there any way that I can add a custom dictionary of some words (only limited to my document) that wordNet does not lemmatize them like:
my_dict_list = [left, ...]