0
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
lemmatizer.lemmatize('Less'.lower())

'le'

What's going on here, and how can I avoid this?

The word 'le' is now appearing all over my LDA topic model, and it doesn't make sense.

Who knows what other words it is affecting in the model. Should I avoid using the Lemmatizer or is there a way to fix this?

SCool
  • 3,104
  • 4
  • 21
  • 49
  • 1
    Lemmatization relies on correct part-of-speech tagging: see https://stackoverflow.com/questions/32957895/wordnetlemmatizer-not-returning-the-right-lemma-unless-pos-is-explicit-python – slothrop Jun 02 '23 at 11:06
  • For example, if you tag "less" as an adjective, you do get the correct answer: `lemmatizer.lemmatize('Less'.lower(), 'a')` gives `'less'` – slothrop Jun 02 '23 at 11:07

1 Answers1

0

I will give more context in addition to the observation in comments. They key is to understand lemmatiziation rules. They depend on the part of speech. Your word is considered a noun (default) and gets its supposed plural suffix stripped twice. Similarly as with the noun mess or its misspeling mes.

from nltk.stem import WordNetLemmatizer
word = 'mes'
wnl = WordNetLemmatizer()
wnl.lemmatize(word) # me

In your case, the right option is (as in the comments)

word = 'less'
wnl = WordNetLemmatizer()
wnl.lemmatize(word, 'a') # less

More: the rules are

from nltk.corpus.reader import WordNetCorpusReader
WordNetCorpusReader.MORPHOLOGICAL_SUBSTITUTIONS
{'n': [('s', ''),
  ('ses', 's'),
  ('ves', 'f'),
  ('xes', 'x'),
  ('zes', 'z'),
  ('ches', 'ch'),
  ('shes', 'sh'),
  ('men', 'man'),
  ('ies', 'y')],
 'v': [('s', ''),
  ('ies', 'y'),
  ('es', 'e'),
  ('es', ''),
  ('ed', 'e'),
  ('ed', ''),
  ('ing', 'e'),
  ('ing', '')],
 'a': [('er', ''), ('est', ''), ('er', 'e'), ('est', 'e')],
 'r': [],
 's': [('er', ''), ('est', ''), ('er', 'e'), ('est', 'e')]}

For the whole algorithm, see the source code of WordNetLemmatizer.lemmatize.

Maciej Skorski
  • 2,303
  • 6
  • 14