A complete example here -
import nltk
from nltk.corpus import wordnet
from difflib import get_close_matches as gcm
from itertools import chain
from nltk.stem.porter import *
texts = [ " apples are good. My teeth will fall out.",
" roses are red. cars are great to have"]
lmtzr = nltk.WordNetLemmatizer()
stemmer = PorterStemmer()
for text in texts:
tokens = nltk.word_tokenize(text) # should sent tokenize it first
token_lemma = [ lmtzr.lemmatize(token) for token in tokens ] # take your pick here between lemmatizer and wordnet synset.
wn_lemma = [ gcm(word, list(set(list(chain(*[i.lemma_names() for i in wordnet.synsets(word)]))))) for word in tokens ]
#print(wn_lemma) # works for unconventional words like 'teeth' --> tooth. You might want to take a closer look
tokens_final = [ stemmer.stem(tokens[i]) if len(tokens[i]) > len(token_lemma[i]) else token_lemma[i] for i in range(len(tokens)) ]
print(tokens_final)
Output
['appl', 'are', 'good', '.', 'My', 'teeth', 'will', 'fall', 'out', '.']
['rose', 'are', 'red', '.', 'car', 'are', 'great', 'to', 'have']
Explanation
Notice stemmer.stem(tokens[i]) if len(tokens[i]) > len(token_lemma[i]) else token_lemma[i]
this is where the magic happens. If the lemmatized word is a subset of the main word, then the word gets stemmed, otherwise it just remains lemmatized.
Note
The lemmatization that you are attempting has some edge cases. WordnetLemmatizer
is not smart enough to handle exceptional cases like 'teeth' --> 'tooth'. In those cases you would want to take a look at Wordnet.synset
which might come in handy.
I have included a small case in the comments for your investigation.