22

I have python list like below

documents = ["Human machine interface for lab abc computer applications",
             "A survey of user opinion of computer system response time",
             "The EPS user interface management system",
             "System and human system engineering testing of EPS",
             "Relation of user perceived response time to error measurement",
             "The generation of random binary unordered trees",
             "The intersection graph of paths in trees",
             "Graph minors IV Widths of trees and well quasi ordering",
             "Graph minors A survey"]

Now i need to stem it (each word) and get another list. How do i do that ?

Ismail Badawi
  • 36,054
  • 7
  • 85
  • 97
ChamingaD
  • 2,908
  • 8
  • 35
  • 58
  • 1
    What do you mean by "stem"? Can you provide sample output? – cha0site Feb 18 '12 at 18:32
  • You are going to need to define what you mean, exactly, by 'stem'. Can we assume it's always going to be English? – Gareth Latty Feb 18 '12 at 18:32
  • 1
    Maybe the [stemming](http://pypi.python.org/pypi/stemming/1.0) package, if you're looking for stemming English words? – wkl Feb 18 '12 at 18:32
  • Commenters: [Stemming on Wikipedia](http://en.wikipedia.org/wiki/Stemming). The question is still ambiguous, though -- there are any number of stemming strategies; do you have one in particular in mind? (Porter?) – Cameron Feb 18 '12 at 18:34
  • @Cameron is correct. Stemming is process of taking word into its root form. – ChamingaD Feb 18 '12 at 18:36
  • @birryree Ya, that does what i wanted. But how to iterate through whole list and stem all words ? from stemming.porter2 import stem stem(word) - returns stemmed word – ChamingaD Feb 18 '12 at 18:38

7 Answers7

42
from stemming.porter2 import stem

documents = ["Human machine interface for lab abc computer applications",
             "A survey of user opinion of computer system response time",
             "The EPS user interface management system",
             "System and human system engineering testing of EPS",
             "Relation of user perceived response time to error measurement",
             "The generation of random binary unordered trees",
             "The intersection graph of paths in trees",
             "Graph minors IV Widths of trees and well quasi ordering",
             "Graph minors A survey"]

documents = [[stem(word) for word in sentence.split(" ")] for sentence in documents]

What we are doing here is using a list comprehension to loop through each string inside the main list, splitting that into a list of words. Then we loop through that list, stemming each word as we go, returning the new list of stemmed words.

Please note I haven't tried this with stemming installed - I have taken that from the comments, and have never used it myself. This is, however, the basic concept for splitting the list into words. Note that this will produce a list of lists of words, keeping the original separation.

If do not want this separation, you can do:

documents = [stem(word) for sentence in documents for word in sentence.split(" ")]

Instead, which will leave you with one continuous list.

If you wish to join the words back together at the end, you can do:

documents = [" ".join(sentence) for sentence in documents]

or to do it in one line:

documents = [" ".join([stem(word) for word in sentence.split(" ")]) for sentence in documents]

Where keeping the sentence structure, or

documents = " ".join(documents)

Where ignoring it.

Gareth Latty
  • 86,389
  • 17
  • 178
  • 183
  • That won't work; each "word" in your listcomp will be a list. – DSM Feb 18 '12 at 18:46
  • Thanks. it stems but splits each word in list. `['comput', 'compil', 'translat', 'sourc', 'code', 'into', 'object', 'code,', 'while', 'interpret', 'execut', 'the', 'program'] ['A', 'compil', 'compil', 'your', 'code', 'into', 'a', '"runable"', 'applic', '(e.g:', 'a', '.ex', 'file)', 'where', 'as', 'an', 'intepret', 'run', 'the', 'sourc', 'code', 'as', 'it', 'goe']` – ChamingaD Feb 18 '12 at 19:02
  • @ChamingaD Edited to include a way to rejoin the lists. – Gareth Latty Feb 18 '12 at 19:07
  • 2
    Is stemming no longer a package in Python 3? – Max Mar 15 '14 at 20:30
  • @Max, did you ever figure out whether its a package in python3? I seem to be having issues with it now – bernando_vialli Nov 30 '17 at 16:51
7

You might want to have a look at the NLTK (Natural Language ToolKit). It has a module nltk.stem which contains various different stemmers.

See also this question.

Community
  • 1
  • 1
Thomas
  • 174,939
  • 50
  • 355
  • 478
4

Alright. So, using the stemming package, you'd have something like this:

from stemming.porter2 import stem
from itertools import chain

def flatten(listOfLists):
    "Flatten one level of nesting"
    return list(chain.from_iterable(listOfLists))

def stemall(documents):
    return flatten([ [ stem(word) for word in line.split(" ")] for line in documents ])
cha0site
  • 10,517
  • 3
  • 33
  • 51
3

you can use NLTK :

from nltk.stem import PorterStemmer


ps = PorterStemmer()
final = [[ps.stem(token) for token in sentence.split(" ")] for sentence in documents]

NLTK has many features for IR Systems, check it

Arash Hatami
  • 5,297
  • 5
  • 39
  • 59
2
from nltk.stem import PorterStemmer
ps = PorterStemmer()
list_stem = [ps.stem(word) for word in list]
Stephen Rauch
  • 47,830
  • 31
  • 106
  • 135
Gigi
  • 587
  • 5
  • 12
1

You could use whoosh: (http://whoosh.readthedocs.io/)

from whoosh.analysis import CharsetFilter, StemmingAnalyzer
from whoosh import fields
from whoosh.support.charset import accent_map

my_analyzer = StemmingAnalyzer() | CharsetFilter(accent_map)

tokens = my_analyzer("hello you, comment ça va ?")
words = [token.text for token in tokens]

print(' '.join(words))
Thomas Decaux
  • 21,738
  • 2
  • 113
  • 124
0

You can use either PorterStemmer or LancasterStemmer for stemming.

9113303
  • 852
  • 1
  • 16
  • 30