How to un-stem a word in Python?

Question

I want to know if there is anyway that I can un-stem them to a normal form?

The problem is that I have thousands of words in different forms e.g. eat, eaten, ate, eating and so on and I need to count the frequency of each word. All of these - eat, eaten, ate, eating etc will count towards eat and hence, I used stemming.

But the next part of the problem requires me to find similar words in data and I am using nltk's synsets to calculate Wu-Palmer Similarity among the words. The problem is that nltk's synsets wont work on stemmed words, or at least in this code they won't. check if two words are related to each other

How should I do it? Is there a way to un-stem a word?

If you refactor your workflow, you could work with tuples within which the first element is the full word and the second is the stemmed representation. This isn't efficient from a storage perspective, but it will make it easier for you to keep track of your words. — duhaime, May 15 '15 at 19:51
How about you check if words are related before stemming? Would that be possible? Then you don't have to store both representations. — Rcynic, May 15 '15 at 20:14
@Rcynic Yeah I thought about that but it would make the work too hectic as in there are too many words to relate then. — silent_dev, May 16 '15 at 07:03
@duhaime Sadly, it's not an option because again there would be too many words — silent_dev, May 16 '15 at 07:04

Rafael Valero · Answer 1 · 2018-04-20T08:32:04.857

I think an ok approach is something like said in https://stackoverflow.com/a/30670993/7127519.

A possible implementations could be something like this:

import re
import string
import nltk
import pandas as pd
stemmer = nltk.stem.porter.PorterStemmer()

An stemmer to use. Here a text to use:

complete_text = ''' cats catlike catty cat 
stemmer stemming stemmed stem 
fishing fished fisher fish 
argue argued argues arguing argus argu 
argument arguments argument '''

Create a list with the different words:

my_list = []
#for i in complete_text.decode().split():
try: 
    aux = complete_text.decode().split()
except:
    aux = complete_text.split()
for i in aux:
    if i not in my_list:
        my_list.append(i.lower())
my_list

with output:

['cats',
 'catlike',
 'catty',
 'cat',
 'stemmer',
 'stemming',
 'stemmed',
 'stem',
 'fishing',
 'fished',
 'fisher',
 'fish',
 'argue',
 'argued',
 'argues',
 'arguing',
 'argus',
 'argu',
 'argument',
 'arguments']

An now create the dictionary:

aux = pd.DataFrame(my_list, columns =['word'] )
aux['word_stemmed'] = aux['word'].apply(lambda x : stemmer.stem(x))
aux = aux.groupby('word_stemmed').transform(lambda x: ', '.join(x))
aux['word_stemmed'] = aux['word'].apply(lambda x : stemmer.stem(x.split(',')[0]))
aux.index = aux['word_stemmed']
del aux['word_stemmed']
my_dict = aux.to_dict('dict')['word']
my_dict

Which output is:

{'argu': 'argue, argued, argues, arguing, argus, argu',
 'argument': 'argument, arguments',
 'cat': 'cats, cat',
 'catlik': 'catlike',
 'catti': 'catty',
 'fish': 'fishing, fished, fish',
 'fisher': 'fisher',
 'stem': 'stemming, stemmed, stem',
 'stemmer': 'stemmer'}

Companion notebook here.

score 5 · Answer 2 · answered May 15 '15 at 19:44

No, there isn't. With stemming, you lose information, not only about the word form (as in eat vs. eats or eaten), but also about the word itself (as in tradition vs. traditional). Unless you're going to use a prediction method to try and predict this information on the basis of the context of the word, there's no way to get it back.

score 3 · Answer 3 · answered Sep 05 '18 at 18:54

tl;dr: you could use any stemmer you want (e.g.: Snowball) and keep track of what word was the most popular before stemming for each stemmed word by counting occurrences.

You may like this open-source project which uses Stemming and contains an algorithm to do Inverse Stemming:

https://github.com/ArtificiAI/Multilingual-Latent-Dirichlet-Allocation-LDA

On this page of the project, there are explanations on how to do the Inverse Stemming. To sum things up, it works like as follow.

First, you will stem some documents, here short (French language) strings with their stop words removed for example: ['sup chat march trottoir', 'sup chat aiment ronron', 'chat ronron', 'sup chien aboi', 'deux sup chien', 'combien chien train aboi']

Then the trick is to have kept the count of the most popular original words with counts for each stemmed word: {'aboi': {'aboie': 1, 'aboyer': 1}, 'aiment': {'aiment': 1}, 'chat': {'chat': 1, 'chats': 2}, 'chien': {'chien': 1, 'chiens': 2}, 'combien': {'Combien': 1}, 'deux': {'Deux': 1}, 'march': {'marche': 1}, 'ronron': {'ronronner': 1, 'ronrons': 1}, 'sup': {'super': 4}, 'train': {'train': 1}, 'trottoir': {'trottoir': 1}}

Finally, you may now guess how to implement this by yourself. Simply take the original words for which there was the most counts given a stemmed word. You can refer to the following implementation, which is available under the MIT License as part of the Multilingual-Latent-Dirichlet-Allocation-LDA project:

https://github.com/ArtificiAI/Multilingual-Latent-Dirichlet-Allocation-LDA/blob/master/lda_service/logic/stemmer.py

Improvements could be made by ditching the non-top reverse words (by using a heap for example) which would yield just one dict in the end instead of a dict of dicts.

score 2 · Accepted Answer · answered May 15 '15 at 20:21

2

I suspect what you really mean by stem is "tense". As in you want the different tense of each word to each count towards the "base form" of the verb.

check out the pattern package

pip install pattern

Then use the en.lemma function to return a verb's base form.

import pattern.en as en
base_form = en.lemma('ate') # base_form == "eat"

answered May 15 '15 at 20:21

steve

2,488
5
26
39

This sounds reasonable. I'll check it as soon as I get a chance. Thanks @Steve :) – silent_dev May 16 '15 at 07:04
This is not working. Example: **richer** should be **rich** but it's giving **richer** only – silent_dev May 16 '15 at 12:29
If it's not doing the job, you can "unaccept" this answer and wait for more suggestions. – alexis Jun 05 '15 at 17:25

score 1 · Answer 5 · answered Jun 05 '15 at 16:04

Theoretically the only way to unstem is if prior to stemming you kept a dictionary of terms or a mapping of any kind and carry on this mapping to your rest of your computations. This mapping should somehow capture the place of your unstemmed token and when there is a need to unstemm a token given that you know the original place of your stemmed token you would be able to trace back and recover the original unstemmed representation with your mapping.

For the Bag of Words representation this seems computationally intensive and somehow defeats the purpose of the statistical nature of the BoW approach.

But again theoretically I believe it could work. I haven't seen that though in any implementation.

How to un-stem a word in Python?

5 Answers5

tl;dr: you could use any stemmer you want (e.g.: Snowball) and keep track of what word was the most popular before stemming for each stemmed word by counting occurrences.

Linked