Stemming vs Lemmatization for financial text in python [NLTK]

Question

To extract more information from annual reports (10ks), I am trying to compare companies based on the cosine similarity. One of the steps in this research is the stemming or lemmatization of words. The reason for doing this is to get the root of the words, so that when you don't have different variation words that at their core mean the same thing. For stemmer and lemmatizer, I used SnowBall stemmer and WordNetLemmatizer from the NLTK package.

E.g. of stemming: ; E.g. of lemmatization walking -> walk walking-> walking walked -> walk walked -> walked or owing -> owe owing -> owing owed -> owe owed -> owed
The question is the following: should I use the stemmer or a lemmatizer for financial text?

The way I see it, a stemmer would be more appropiate for this kind of research.

Disclaimer: I know there is already a question discussing stemming vs lemmatization on stackoverflow. However, I am looking for some clarification regarding financial text in particular not as a general case.

I think this type of question is better suited for https://datascience.stackexchange.com/ — Dani Mesejo, Oct 26 '18 at 09:47
I think that, the lemmatization with the POS (Part Of Sentence) tag, cloud be a good idea, maybe the similarity could be done using something like word2vec, representing the word, the lemma and the POS Tag as a vector. Maybe this could fit with what you want to do, because if you provide more information about the lemmas you can get better matching when you want to get similarity — Jason Jiménez, Oct 26 '18 at 10:18
It's an empirical question, just have to test them out on a specific task and and see which does better https://stackoverflow.com/questions/17317418/stemmers-vs-lemmatizers — alvas, Oct 28 '18 at 15:13

Stemming vs Lemmatization for financial text in python [NLTK]

0 Answers0