I am stemming a list of words and making a dataframe from it. The original data is as follow:
original = 'The man who flies the airplane dies in an air crash. His wife died a couple of weeks ago.'
df = pd.DataFrame({'text':[original]})
the functions I've used for lemmatisation and stemming are:
# lemmatize & stemmed.
def lemmatize_stemming(text):
return stemmer.stem(WordNetLemmatizer().lemmatize(text, pos='v'))
def preprocess(text):
result = []
for token in gensim.utils.simple_preprocess(text):
if token not in gensim.parsing.preprocessing.STOPWORDS:
result.append(lemmatize_stemming(token))
return result
The output will come from running df['text'].map(preprocess)[0]
for which I get:
['man',
'fli',
'airplan',
'die',
'air',
'crash',
'wife',
'die',
'coupl',
'week',
'ago']
I wonder how can I return the output to the original tokens? for instance I have die which is from died and dies.