1

I am applying wordNet lemmatizer into my corpus and I need to define the pos tagger for lemmatizer:

stemmer = PorterStemmer()
def lemmitize(document):
    return stemmer.stem(WordNetLemmatizer().lemmatize(document, pos='v'))

def preprocess(document):
output = []
    for token in gensim.utils.simple_preprocess(document):
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
            print("lemmitize: ", lemmitize(token))
            output.append(lemmitize(token))
    return output

Now as you can see I am defining pos for verb (and I know wordNet default pos is a noun), however when I lemmatized my document:

the left door closed at the night  

I am getting out put as:

output:  ['leav', 'door', 'close', 'night']

which this is not what i was expecting. In my above sentences, left points to which door (e.g. right or left). If I choose pos ='n' this problem may solve but it will then act as a wornNet default and there will be no effects on words like taken.

I found a similar issue in here and I modified the exception list in nltk_data/corpora/wordnet/verb.exc and I changed left leave to left left but still, I am getting the same results as leav.
Now I am wondering if there is any solution to this problem or in the best case, is there any way that I can add a custom dictionary of some words (only limited to my document) that wordNet does not lemmatize them like:

my_dict_list = [left, ...]
Bilgin
  • 499
  • 1
  • 10
  • 25
  • Is your use of the stemmer intentional? Also, are you aware that you can find POS based on context and not hardcode it? – Tiago Duque Aug 22 '19 at 15:22
  • @TiagoDuque Thanks for your comment. I think lemmatizing can normalize my text and no need for stemming since i need the words to be the root format. Can you please give additional details on how to find POS based on context? thnaks – Bilgin Sep 09 '19 at 14:37
  • Sure, give me some time. – Tiago Duque Sep 09 '19 at 14:41

2 Answers2

2

You made a common mistake between Lemmatizing and Stemming.

Stemming

Stemming means reducing the word to its prefix. It is not related to grammar, but linked to your own data or the algorithm used.

The most commonly used stemmer, the Porter Stemmer, for example, removes "morphological and inflexional endings from words" (Porter Stemmer Website)

Therefore, words like, cook, cooking, cooked and cookie have its morphological/inflexional endings removed, ending all into "Cook". However, take notice that you're bundling together a noun, a verb in the present continuous, a verb in past tense and another noun, all together (notice that cookie, for example, even though is a cooked food, doesn't actually share a "hierarchy" with the word "cook" or "to cook").

When you do:

stemmer.stem(WordNetLemmatizer().lemmatize(document))

You're stemming using wordnet - First you lemmatize the word, then you stem it, removing the "morphological/inflexional" words. In fact, you don't even need to lemmatize if you do stemming (it will only alter something with irregular verbs).

Lemmatizing

Lemmatizing, on another hand, uses lexical information to reduce a word to its "default", non flexed form. For it to work, it is very imporant to give te POS (since, as you've seen, leaves is a lexemme that represents both a verb and a noun).

But how to find the part of speech?

There are some techniques today, but the one most used is based both on a lookup table and surrounding words - these are fed into a pre-trained machine learning algorithm that returns the most probable tag. Read more in: https://medium.com/greyatom/learning-pos-tagging-chunking-in-nlp-85f7f811a8cb

Using a popular Python NLP package called NLTK, you can do: (you have to download the pertinent packages first)

import nltk

sentence = "I want to tag these!"
token = nltk.word_tokenize(sentence)
nltk.pos_tag(token)

Result:

[('I', 'PRP'), ('want', 'VBP'), ('to', 'TO'), ('tag', 'VB'), ('these', 'DT'), ('!', '.')]

Another popular tool is Spacy, which goes as follows: (you have to download the language model with the machine learning trained model first)

import spacy

import spacy
nlp = spacy.load('en')
doc = nlp('I want to tag these!')
print([(token.text, token.pos_) for token in doc])

Result:

[('I', 'PRON'), ('want', 'VERB'), ('to', 'PART'), ('tag', 'VERB'), ('these', 'DET'), ('!', 'PUNCT')]

You can read more about Spacy's POS tagging here: https://spacy.io/usage/linguistic-features/

I would then recommend you to stick to lemmatization, since this will give you a more fine-grained options to work with.

Tiago Duque
  • 1,956
  • 1
  • 12
  • 31
  • @Tiango Duque Thanks for your comment and very good explanation. I checked the link you provided above but I am looking for answers like @@Coding4pho suggested or any dynamic solution to this. – Bilgin Sep 09 '19 at 22:13
1

You can add a custom dictionary for certain words, like pos_dict = {'breakfasted':'v', 'left':'a', 'taken':'v'}

By passing this customized pos_dict along with token into the lemmitize function, you can use the lemmatizer for each token with a POS tag that you specify.

lemmatize(token, pos_dict.get(token, 'n')) will pass 'n' for its second argument as a default value, unless the token is in the pos_dict keys. You can change this default value to whatever you want.

def lemmitize(document, pos_dict):
    return stemmer.stem(WordNetLemmatizer().lemmatize(document, pos_dict.get(document, 'n')))

def preprocess(document, pos_dict):
    output = []
    for token in gensim.utils.simple_preprocess(document):
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > 3:
            print("lemmitize: ", lemmitize(token, pos_dict))
            output.append(lemmitize(token, pos_dict))
    return output
Coding4pho
  • 86
  • 4
  • Thanks for the comment and sorry for late reply. In the code you provided in above, should not work as `` lemmitize(document)`` takes one positional argument but you were passing two in ``print("lemmitize: ", lemmitize(token, pos_dict.get(token, 'n')))``. Can you please check this? thanks – Bilgin Sep 09 '19 at 22:09
  • My apologies! I might have been confused between your `lemmitize()` function and WordNetLemmatizer's `lemmatize()` while I was typing up my answer. It's fixed now. – Coding4pho Sep 11 '19 at 02:20
  • Thanks for fixing the code. Now it is working fine. – Bilgin Sep 28 '19 at 21:37