Should you Stem and lemmatize?

Question

I am currently working with python NLTK to preprocess text data for Kaggle SMS Spam Classification Dataset. I have completed the following steps during preprocessing:

Removed any extra spaces
Removed punctuation and special characters
Converted the text to lower case
Replaced abbreviations such as lol,brb etc with their meaning or full form.
Removed stop words
Tokenized the data

Now I plan to perform lemmatization and stemming separately on the tokenized data followed by TF-IDF done separately on lemmatized data and stemmed data.

Questions are as follows:

Is there a practical use case to perform lemmatization on the tokenized data and then stem that lemmatized data or vice versa
Does the idea of stemming the lemmatized data or vice versa make any sense theoretically, or is it completely incorrect.

Context: I am relatively new to NLP and hence I am trying to understand as much as I can about these concepts. The main idea behind this question is to understand whether lemmatization or stemming together make any sense theoretically/practically or whether these should be done separately.

Questions Referenced:

Should I perform both lemmatization and stemming?: The answer to this question was inconclusive and not accepted, it never discussed why you should or should not do it in the first place.
What is the difference between lemmatization vs stemming?: Provides the ideas behind stemming and lemmatization but I was unable to conclude the answers to my questions based on this
Stemmers vs Lemmatizers: Explains the pros and cons, as well as the context in which stemming and lemmatization, might help
NLP Stemming and Lemmatization using Regular expression tokenization: The question discusses the different preprocessing steps and does stemming and lemmatization separately

They don't make sense to do together; it's one or the other. ***Stemming*** is cheap, nasty and fallible. ***Lemmatization*** is more accurate. So do Lemmatization. (Later when you get into Deep Learning, you can optionally skip this step.) — smci, Feb 25 '22 at 10:43

score 3 · Accepted Answer · answered Feb 25 '22 at 10:39

Is there a practical use case to perform lemmatization on the tokenized data and then stem that lemmatized data or vice versa
Does the idea of stemming the lemmatized data or vice versa make any sense theoretically, or is it completely incorrect.

Regarding (1): Lemmatisation and stemming do essentially the same thing: they convert an inflected word form to a canonical form, on the assumption that features expressed through morphology (such as word endings) are not important for the use case. If you are not interested in tense, number, voice, etc, then lemmatising/stemming will reduce the number of distinct word forms you have to deal with (as different variations get folded into one canonical form). So without knowing what you want to do exactly, and whether morphological information is relevant to that task, it's hard to answer.

Lemmatisation is a linguistically motivated procedure. Its output is a valid word in the target language, but with endings etc removed. It is not without information loss, but there are not that many problematic cases. Is does a third person singular auxiliary verb, or the plural of a female deer? Is building a noun, referring to a structure, or a continuous form of the verb to build? What about housing? A casing for an object (such as an engine) or the process of finding shelter for someone?

Stemming is a less resource intense procedure, but as a trade-off it works with approximations only. You will have less precise results, which might not matter too much in an application such as information retrieval, but if you are at all interested in meaning, then it is probably too coarse a tool. Its output also will not be a word, but a 'stem', basically a character string roughly related to those you get when stemming similar words.

Re (2): no, it doesn't make any sense. Both procedures attempt the same task (normalising inflected words) in different ways, and once you have lemmatised, stemming is pointless. And if you stem first, you generally do not end up with valid words, so lemmatisation would not work anyway.

Should you Stem and lemmatize?

1 Answers1