how to create a function that tokenizes and stems the words

Question

My code

def tokenize_and_stem(text):

    tokens = [sent for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(text)]

    filtered_tokens = [token for token in tokens if re.search('[a-zA-Z]', token)]

    stems = stemmer.stem(filtered_tokens)

words_stemmed = tokenize_and_stem("Today (May 19, 2016) is his only daughter's wedding.")
print(words_stemmed)

and I'm getting this error

AttributeError Traceback (most recent call last) in 13 return stems 14 ---> 15 words_stemmed = tokenize_and_stem("Today (May 19, 2016) is his only daughter's wedding.") 16 print(words_stemmed)

in tokenize_and_stem(text) 9
10 # Stem the filtered_tokens ---> 11 stems = stemmer.stem(filtered_tokens) 12
13 return stems

/usr/local/lib/python3.6/dist-packages/nltk/stem/snowball.py in stem(self, word) 1415 1416 """ -> 1417 word = word.lower() 1418 1419 if word in self.stopwords or len(word) <= 2:

AttributeError: 'list' object has no attribute 'lower'

It looks like `stemmer.stem` expects a string, not a list of strings. You might try `stems = list(map(stemmer.stem, filtered_tokens))`. And add a `return stems` to your function. — brentertainer, Nov 20 '19 at 14:47

AbdulRahim Khan · Accepted Answer · 2019-11-26T02:27:12.273

1

YOUR CODE

def tokenize_and_stem(text):

tokens = [sent for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(text)]

filtered_tokens = [token for token in tokens if re.search('[a-zA-Z]', token)]

stems = stemmer.stem(filtered_tokens)

words_stemmed = tokenize_and_stem("Today (May 19, 2016) is his only daughter's 
wedding.")
print(words_stemmed)

The error says """word = word.lower()... if word in self.stopwords or len(word) <= 2: list object has no attribute 'lower'"""

The error is not only because of .lower() but because of the length If you try to run it with out changing the filtered_tokens on the 5th line, without changing means using yours. you will get no error but the output will be like this:

["today (may 19, 2016) is his only daughter's wedding.", "today (may 19, 2016) is his only daughter's wedding.", "today (may 19, 2016) is his only daughter's wedding.", "today (may 19, 2016) is his only daughter's wedding.", "today (may 19, 2016) is his only daughter's wedding.", "today (may 19, 2016) is his only daughter's wedding.", "today (may 19, 2016) is his only daughter's wedding.", "today (may 19, 2016) is his only daughter's wedding.", "today (may 19, 2016) is his only daughter's wedding.", "today (may 19, 2016) is his only daughter's wedding.", "today (may 19, 2016) is his only daughter's wedding.", "today (may 19, 2016) is his only daughter's wedding.", "today (may 19, 2016) is his only daughter's wedding.", "today (may 19, 2016) is his only daughter's wedding."]

Here is your fixed code.

def tokenize_and_stem(text):

    tokens = [word for sent in nltk.sent_tokenize(text) for word in nltk.word_tokenize(sent)]

    filtered_tokens = [token for token in tokens if re.search('[a-zA-Z]', token)]

    stems = [stemmer.stem(t) for t in filtered_tokens if len(t) > 0]

    return stems

words_stemmed = tokenize_and_stem("Today (May 19, 2016) is his only daughter's wedding.")
print(words_stemmed)

So, i have only changed line 3 and line 7

edited Nov 26 '19 at 02:27

answered Nov 26 '19 at 02:20

AbdulRahim Khan

121
1
3

Thank you, your code not only worked but also helped me in my project because you didn't change my code – AceDasXan Nov 26 '19 at 02:58
alvas welp, the answer from harish worked but gave error in further of my code, but this worked because he didn't changed any of my code – AceDasXan Nov 26 '19 at 04:01
@alvas so, what's wrong with my code?? if the creator of this post verifies my answer then the discussion is pointless. – AbdulRahim Khan Nov 26 '19 at 04:16
Yes, it's a bad answer because it's a means to get an output at mutliple times the speed. – alvas Nov 26 '19 at 04:18
Optimize the answer and you'll get the encouragement based on the answer from https://stackoverflow.com/a/47788736/610569 Sorry for the abrupt harshness but having people use bad code from copy+paste from stackoverflow will cause quite a lot of issues later on =) – alvas Nov 26 '19 at 04:20
1

bruh... i just don't type the code. i fix it then share my answer – AbdulRahim Khan Nov 26 '19 at 04:22
1

and how do i copy+paste i never used stackoverflow i'm newbie, and it took 3 or 4 mins for me to solve that problem – AbdulRahim Khan Nov 26 '19 at 04:25
Not you copy+pasting, I mean people who read the answers and copy+paste the non-optimized solution. No worries, make the fixes and we learn together =) – alvas Nov 26 '19 at 05:14
well its an optimized code because the code belongs to a project of datacamp and i have completed it so i see this guy was also doing the same project i did, and he encountered an error, so i used my knowledge on solving his problem, you can't just call it "non-optimized" without knowing what the code belongs to – AbdulRahim Khan Nov 26 '19 at 11:41

score 0 · Answer 2 · answered Nov 20 '19 at 14:52

import nltk
import string
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
def tokenize_and_stem(text):
    tokens = nltk.tokenize.word_tokenize(text)
    # strip out punctuation and make lowercase
    tokens = [token.lower().strip(string.punctuation)
              for token in tokens if token.isalnum()]

    # now stem the tokens
    tokens = [stemmer.stem(token) for token in tokens]

    return tokens

tokenize_and_stem("Today (May 19, 2016) is his only daughter's wedding.")

Output:

['today', 'may', '19', '2016', 'is', 'hi', 'onli', 'daughter', 'wed']

how to create a function that tokenizes and stems the words

2 Answers2