How to stem words in python list?

Question

I have python list like below

documents = ["Human machine interface for lab abc computer applications",
             "A survey of user opinion of computer system response time",
             "The EPS user interface management system",
             "System and human system engineering testing of EPS",
             "Relation of user perceived response time to error measurement",
             "The generation of random binary unordered trees",
             "The intersection graph of paths in trees",
             "Graph minors IV Widths of trees and well quasi ordering",
             "Graph minors A survey"]

Now i need to stem it (each word) and get another list. How do i do that ?

You are going to need to define what you mean, exactly, by 'stem'. Can we assume it's always going to be English? — Gareth Latty, Feb 18 '12 at 18:32
Maybe the [stemming](http://pypi.python.org/pypi/stemming/1.0) package, if you're looking for stemming English words? — wkl, Feb 18 '12 at 18:32
Commenters: [Stemming on Wikipedia](http://en.wikipedia.org/wiki/Stemming). The question is still ambiguous, though -- there are any number of stemming strategies; do you have one in particular in mind? (Porter?) — Cameron, Feb 18 '12 at 18:34
@Cameron is correct. Stemming is process of taking word into its root form. — ChamingaD, Feb 18 '12 at 18:36
@birryree Ya, that does what i wanted. But how to iterate through whole list and stem all words ? from stemming.porter2 import stem stem(word) - returns stemmed word — ChamingaD, Feb 18 '12 at 18:38

Gareth Latty · Accepted Answer · 2012-02-18T19:07:17.817

from stemming.porter2 import stem

documents = ["Human machine interface for lab abc computer applications",
             "A survey of user opinion of computer system response time",
             "The EPS user interface management system",
             "System and human system engineering testing of EPS",
             "Relation of user perceived response time to error measurement",
             "The generation of random binary unordered trees",
             "The intersection graph of paths in trees",
             "Graph minors IV Widths of trees and well quasi ordering",
             "Graph minors A survey"]

documents = [[stem(word) for word in sentence.split(" ")] for sentence in documents]

What we are doing here is using a list comprehension to loop through each string inside the main list, splitting that into a list of words. Then we loop through that list, stemming each word as we go, returning the new list of stemmed words.

Please note I haven't tried this with stemming installed - I have taken that from the comments, and have never used it myself. This is, however, the basic concept for splitting the list into words. Note that this will produce a list of lists of words, keeping the original separation.

If do not want this separation, you can do:

documents = [stem(word) for sentence in documents for word in sentence.split(" ")]

Instead, which will leave you with one continuous list.

If you wish to join the words back together at the end, you can do:

documents = [" ".join(sentence) for sentence in documents]

or to do it in one line:

documents = [" ".join([stem(word) for word in sentence.split(" ")]) for sentence in documents]

Where keeping the sentence structure, or

documents = " ".join(documents)

Where ignoring it.

That won't work; each "word" in your listcomp will be a list. — DSM, Feb 18 '12 at 18:46
Thanks. it stems but splits each word in list. `['comput', 'compil', 'translat', 'sourc', 'code', 'into', 'object', 'code,', 'while', 'interpret', 'execut', 'the', 'program'] ['A', 'compil', 'compil', 'your', 'code', 'into', 'a', '"runable"', 'applic', '(e.g:', 'a', '.ex', 'file)', 'where', 'as', 'an', 'intepret', 'run', 'the', 'sourc', 'code', 'as', 'it', 'goe']` — ChamingaD, Feb 18 '12 at 19:02
@Max, did you ever figure out whether its a package in python3? I seem to be having issues with it now — bernando_vialli, Nov 30 '17 at 16:51

score 7 · Answer 2 · edited May 23 '17 at 12:02

7

You might want to have a look at the NLTK (Natural Language ToolKit). It has a module nltk.stem which contains various different stemmers.

See also this question.

edited May 23 '17 at 12:02

Community

1
1

answered Feb 18 '12 at 18:35

Thomas

174,939
50
355
478

Thanks :) can i know how to iterate through whole list and stem all words ? – ChamingaD Feb 18 '12 at 18:40
1

@ChamingaD: `words = [w for line in documents for w in line.split()]`. Or even `words = ' '.join(documents).split()` – Niklas B. Feb 18 '12 at 18:46

score 4 · Answer 3 · answered Feb 18 '12 at 18:43

4

Alright. So, using the stemming package, you'd have something like this:

from stemming.porter2 import stem
from itertools import chain

def flatten(listOfLists):
    "Flatten one level of nesting"
    return list(chain.from_iterable(listOfLists))

def stemall(documents):
    return flatten([ [ stem(word) for word in line.split(" ")] for line in documents ])

answered Feb 18 '12 at 18:43

cha0site

10,517
3
33
51

How can i stop splitting each word in final list ? – ChamingaD Feb 18 '12 at 19:06
By joining them together using `" ".join(list_of_words)` – Tuan Anh Hoang-Vu Apr 11 '13 at 17:46

score 3 · Answer 4 · answered Nov 25 '17 at 09:48

3

you can use NLTK :

from nltk.stem import PorterStemmer


ps = PorterStemmer()
final = [[ps.stem(token) for token in sentence.split(" ")] for sentence in documents]

NLTK has many features for IR Systems, check it

answered Nov 25 '17 at 09:48

Arash Hatami

5,297
5
39
59

score 2 · Answer 5 · edited Jun 05 '18 at 02:12

2

from nltk.stem import PorterStemmer
ps = PorterStemmer()
list_stem = [ps.stem(word) for word in list]

edited Jun 05 '18 at 02:12

Stephen Rauch

47,830
31
106
135

answered Jun 05 '18 at 01:48

Gigi

587
5
12

score 1 · Answer 6 · answered May 23 '18 at 17:29

You could use whoosh: (http://whoosh.readthedocs.io/)

from whoosh.analysis import CharsetFilter, StemmingAnalyzer
from whoosh import fields
from whoosh.support.charset import accent_map

my_analyzer = StemmingAnalyzer() | CharsetFilter(accent_map)

tokens = my_analyzer("hello you, comment ça va ?")
words = [token.text for token in tokens]

print(' '.join(words))

score 0 · Answer 7 · answered Sep 25 '18 at 10:26

0

You can use either PorterStemmer or LancasterStemmer for stemming.

answered Sep 25 '18 at 10:26

9113303

852
1
16
30

How to stem words in python list?

7 Answers7