Stemming words with NLTK (python)

Question

I am new to Python text processing, I am trying to stem word in text document, has around 5000 rows.

I have written below script

from nltk.corpus import stopwords # Import the stop word list
from nltk.stem.snowball import SnowballStemmer

stemmer = SnowballStemmer('english')

def Description_to_words(raw_Description ):
    # 1. Remove HTML
    Description_text = BeautifulSoup(raw_Description).get_text() 
    # 2. Remove non-letters        
    letters_only = re.sub("[^a-zA-Z]", " ", Description_text) 
    # 3. Convert to lower case, split into individual words
    words = letters_only.lower().split()                       

    stops = set(stopwords.words("english"))                  
    # 5. Remove stop words
    meaningful_words = [w for w in words if not w in stops]   
    # 5. stem words
    words = ([stemmer.stem(w) for w in words])

    # 6. Join the words back into one string separated by space, 
    # and return the result.
    return( " ".join( meaningful_words ))   

clean_Description = Description_to_words(train["Description"][15])

But when I test results words were not stemmed , can anyone help me to know what is issue , I am doing something wrong in "Description_to_words" function

And, when I execute stem command separately like below it works.

from nltk.tokenize import sent_tokenize, word_tokenize
>>> words = word_tokenize("MOBILE APP - Unable to add reading")
>>> 
>>> for w in words:
...     print(stemmer.stem(w))
... 
mobil
app
-
unabl
to
add
read

cs95 · Accepted Answer · 2017-08-14T09:01:11.617

5

Here's each step of your function, fixed.

Remove HTML.

Description_text = BeautifulSoup(raw_Description).get_text()

Remove non-letters, but don't remove whitespaces just yet. You can also simplify your regex a bit.
```
letters_only = re.sub("[^\w\s]", " ", Description_text) 
```

Convert to lower case, split into individual words: I recommend using word_tokenize again, here.

from nltk.tokenize import word_tokenize
words = word_tokenize(letters_only.lower())

Remove stop words.

stops = set(stopwords.words("english")) 
meaningful_words = [w for w in words if not w in stops]

Stem words. Here is another issue. Stem meaningful_words, not words.
```
return ' '.join(stemmer.stem(w) for w in meaningful_words])
```

edited Aug 14 '17 at 09:01

answered Aug 14 '17 at 08:46

cs95

379,657
97
704
746

this is simply great. Thanks a lot for your response. It works. I am very happy :) – user3734568 Aug 14 '17 at 08:56
just one question we can use same logic in Lemmatization word.lemmatize() correct – user3734568 Aug 14 '17 at 08:57
2

@user3734568 yes, you can, just by changing `stemmer.stem(w)` to `lemmatizer.lemmatize(word)` – cs95 Aug 14 '17 at 08:59
Thanks a lot for your help. – user3734568 Aug 14 '17 at 09:01

Stemming words with NLTK (python)

1 Answers1

Linked