1

I am stemming a list of words and making a dataframe from it. The original data is as follow:

original = 'The man who flies the airplane dies in an air crash. His wife died a couple of weeks ago.'
df = pd.DataFrame({'text':[original]})

the functions I've used for lemmatisation and stemming are:

# lemmatize & stemmed.
def lemmatize_stemming(text):
    return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))

def preprocess(text):
    result = []
    for token in gensim.utils.simple_preprocess(text):
        if token not in gensim.parsing.preprocessing.STOPWORDS:
            result.append(lemmatize_stemming(token))
    return result

The output will come from running df['text'].map(preprocess)[0] for which I get:

['man',
 'fli',
 'airplan',
 'die',
 'air',
 'crash',
 'wife',
 'die',
 'coupl',
 'week',
 'ago']

I wonder how can I return the output to the original tokens? for instance I have die which is from died and dies.

Wiliam
  • 1,078
  • 10
  • 21
  • 1
    This might help: https://stackoverflow.com/questions/9481081/the-reverse-process-of-stemming Different Language but the idea is the same. Stemming isn't reversible, unless you want to track before and after for every time you stem something. – fam-woodpecker Nov 09 '21 at 00:05

2 Answers2

1

Stemming destroys information in the original corpus, by non-reversibly turning multiple tokens into some shared 'stem' form.

I you want the original text, you need to retain it yourself.

But also, note: many algorithms working on large amounts of data, like word2vec under ideal conditions, don't necessarily need or even benefit from stemming. You want to have vectors for all the words in the original text – not just the stems – and with enough data, the related forms of a word will get similar vectors. (Indeed, they'll even differ in useful ways, with all 'past' or 'adverbial' or whatever variants sharing a similar directional skew.)

So only do it if you're sure it's helping your goals, within your corpus limits & goals.

micstr
  • 5,080
  • 8
  • 48
  • 76
gojomo
  • 52,260
  • 14
  • 86
  • 115
  • Thank you for this clarification, so my current project is about topic modelling and I’m using LDA algorithm for that. Was wondering if you have any suggestions there re the stemming, is it useful for LDA? – Wiliam Nov 09 '21 at 01:34
  • There's no way to know for sure beforehand – it depends on your data quality & quanitity & end use. Best approach is to try it both ways & see which does better - stemming or not - on some project-specific, repeatable, robust evaluation that approximates your real need. – gojomo Nov 09 '21 at 03:26
0

You could return the mapping relationship as the result and perform postprocessing later.

def preprocess(text):
    lemma_mapping = []
    for token in gensim.utils.simple_preprocess(text):
        if token not in gensim.parsing.preprocessing.STOPWORDS:
            lemma_mapping[token] = lemmatize_stemming(token)
    return lemma_mapping

Or store it as a by-product.

from collections import defaultdict

lemma_mapping = defaultdict(str)
def preprocess(text):
    result = []
    for token in gensim.utils.simple_preprocess(text):
        if token not in gensim.parsing.preprocessing.STOPWORDS:
            lemma = lemmatize_stemming(token)
            result.append(lemma)
            lemma_mapping[token] = lemma
    return result
jadore801120
  • 77
  • 1
  • 3