I am currently working with python NLTK to preprocess text data for Kaggle SMS Spam Classification Dataset. I have completed the following steps during preprocessing:
- Removed any extra spaces
- Removed punctuation and special characters
- Converted the text to lower case
- Replaced abbreviations such as lol,brb etc with their meaning or full form.
- Removed stop words
- Tokenized the data
Now I plan to perform lemmatization and stemming separately on the tokenized data followed by TF-IDF done separately on lemmatized data and stemmed data.
Questions are as follows:
- Is there a practical use case to perform lemmatization on the tokenized data and then stem that lemmatized data or vice versa
- Does the idea of stemming the lemmatized data or vice versa make any sense theoretically, or is it completely incorrect.
Context: I am relatively new to NLP and hence I am trying to understand as much as I can about these concepts. The main idea behind this question is to understand whether lemmatization or stemming together make any sense theoretically/practically or whether these should be done separately.
Questions Referenced:
- Should I perform both lemmatization and stemming?: The answer to this question was inconclusive and not accepted, it never discussed why you should or should not do it in the first place.
- What is the difference between lemmatization vs stemming?: Provides the ideas behind stemming and lemmatization but I was unable to conclude the answers to my questions based on this
- Stemmers vs Lemmatizers: Explains the pros and cons, as well as the context in which stemming and lemmatization, might help
- NLP Stemming and Lemmatization using Regular expression tokenization: The question discusses the different preprocessing steps and does stemming and lemmatization separately