0
from stemming.porter2 import stem

documents = ['got',"get"]

documents = [[stem(word) for word in sentence.split(" ")] for sentence in documents]
print(documents)

The result is :

[['got'], ['get']]

Can someone help to explain this ? Thank you !

user2864740
  • 60,010
  • 15
  • 145
  • 220
Jiazhao Li
  • 91
  • 4
  • 2
    NIT: It's an artifact of the "stemming" library/method used - not Python, which is just the framework/runtime. The *specific library used* should thus be included in the question: is it https://pypi.org/project/stemming/1.0/? – user2864740 Aug 25 '18 at 02:39
  • It's one of those peculiar cases where a stemmer is not powerful enough to understand what you want. Consider lemmatizing the word first and then stemming the lemma: `nltk.WordNetLemmatizer().lemmatize("got","v")` - > `"get"`. – DYZ Aug 25 '18 at 23:44

1 Answers1

2

What you want is a lemmatizer instead of a stemmer. The difference is subtle.

Generally, a stemmer drops suffixes as much as possible and in some cases handles an exception list of words for words that cannot find a normalized form by simply dropping suffixes.

A lemmatizer tries to find the "basic"/root/infinitive form of a word and usually, it requires specialized rules for different languages.

See


Lemmatization using the NLTK implementation of the morphy lemmatizer requires the correct part-of-speech (POS) tag to be fairly accurate.

Avoid (or in fact never) try to lemmatize individual word in isolation. Try lemmatizing a fully POS tagged sentence, e.g.

from nltk import word_tokenize, pos_tag
from nltk import wordnet as wn

def penn2morphy(penntag, returnNone=False, default_to_noun=False):
    morphy_tag = {'NN':wn.NOUN, 'JJ':wn.ADJ,
                  'VB':wn.VERB, 'RB':wn.ADV}
    try:
        return morphy_tag[penntag[:2]]
    except:
        if returnNone:
            return None
        elif default_to_noun:
            return 'n'
        else:
            return ''

With the penn2morphy helper function, you need to convert the POS tag from pos_tag() to the morphy tags and you can then:

>>> from nltk.stem import WordNetLemmatizer
>>> wnl = WordNetLemmatizer()
>>> sent = "He got up in bed at 8am."
>>> [(token, penn2morphy(tag)) for token, tag in pos_tag(word_tokenize(sent))]
[('He', ''), ('got', 'v'), ('up', ''), ('in', ''), ('bed', 'n'), ('at', ''), ('8am', ''), ('.', '')]
>>> [wnl.lemmatize(token, pos=penn2morphy(tag, default_to_noun=True)) for token, tag in pos_tag(word_tokenize(sent))]
['He', 'get', 'up', 'in', 'bed', 'at', '8am', '.']

For convenience you can also try the pywsd lemmatizer.

>>> from pywsd.utils import lemmatize_sentence
Warming up PyWSD (takes ~10 secs)... took 7.196984529495239 secs.
>>> sent = "He got up in bed at 8am."
>>> lemmatize_sentence(sent)
['he', 'get', 'up', 'in', 'bed', 'at', '8am', '.']

See also https://stackoverflow.com/a/22343640/610569

alvas
  • 115,346
  • 109
  • 446
  • 738