Tokenizing words into a new column in a pandas dataframe

Question

I am trying to go through a list of comments collected on a pandas dataframe and tokenize those words and put those words in a new column in the dataframe but I have having an error running through this, is

The error is stating that AttributeError: 'unicode' object has no attribute 'apwords'

Is there any other way to do this? Thanks

def apwords(words):
    filtered_sentence = []
    words = word_tokenize(words)
    for w in words:
        filtered_sentence.append(w)
    return filtered_sentence
addwords = lambda x: x.apwords()
df['words'] = df['complaint'].apply(addwords)
print df

ysearka · Accepted Answer · 2016-06-30T11:19:01.723

1

Your way to apply the lambda function is correct, it is the way you define addwords that doesn't work.

When you define apwords you define a function not an attribute therefore when you want to apply it, use:

addwords = lambda x: apwords(x)

And not:

addwords = lambda x: x.apwords()

If you want to use apwords as an attribute, you would need to define a class that inheritates from string and define apwords as an attribute in this class.

It is far easier to stay with the function:

def apwords(words):
    filtered_sentence = []
    words = word_tokenize(words)
    for w in words:
        filtered_sentence.append(w)
    return filtered_sentence
addwords = lambda x: apwords(x)
df['words'] = df['complaint'].apply(addwords)

edited Jun 30 '16 at 11:19

answered Jun 30 '16 at 11:11

ysearka

3,805
5
20
41

I tried doing what you and João Almeida suggested but I am getting a TypeError: expected string or buffer now, is that because like what you said I have to define a class that inherits from a string and do my original method? Thanks – user3655574 Jun 30 '16 at 13:46
No, it must mean that in your `df['complaints']` you have something else than strings. if you use `df.dtypes` you must have `object` type in front of `complaints` don't you? I think, the most likely is you have missing values (which aren't strings), then before applying `addwords` type `df['complaints'] = df['complaints'].fillna('')` to replace `nan` values by empty strings. – ysearka Jun 30 '16 at 13:56
@ysearka , would you be able to twist this code to pull a sentence that contain a specific word? – Ian_De_Oliveira Jul 26 '18 at 07:28
What do you mean by that? Could you describe the input you have and output you desire? That would make it far easier to understand and answer. – ysearka Jul 26 '18 at 08:17

score 0 · Answer 2 · answered Jun 30 '16 at 10:18

0

Don't you just want to do this:

   df['words'] = df['complaint'].apply(apwords)

you don't need to define the function addwords. Which should be defined as:

addwords = lambda x: apwords(x)

answered Jun 30 '16 at 10:18

João Almeida

4,487
2
19
35

Tokenizing words into a new column in a pandas dataframe

2 Answers2

Linked