how to use word_tokenize in data frame

Question

I have recently started using the nltk module for text analysis. I am stuck at a point. I want to use word_tokenize on a dataframe, so as to obtain all the words used in a particular row of the dataframe.

data example:
       text
1.   This is a very good site. I will recommend it to others.
2.   Can you please give me a call at 9983938428. have issues with the listings.
3.   good work! keep it up
4.   not a very helpful site in finding home decor. 

expected output:

1.   'This','is','a','very','good','site','.','I','will','recommend','it','to','others','.'
2.   'Can','you','please','give','me','a','call','at','9983938428','.','have','issues','with','the','listings'
3.   'good','work','!','keep','it','up'
4.   'not','a','very','helpful','site','in','finding','home','decor'

Basically, i want to separate all the words and find the length of each text in the dataframe.

I know word_tokenize can for it for a string, but how to apply it onto the entire dataframe?

Please help!

Thanks in advance...

Your problem description lacks data inputs, your code, your desired output can you flesh this out? Thanks — EdChum, Oct 13 '15 at 08:54
@EdChum: have edited the query. Hope it has the information required. — eclairs, Oct 13 '15 at 09:13

ilyakhov · Accepted Answer · 2015-10-13T09:42:54.857

38

You can use apply method of DataFrame API:

import pandas as pd
import nltk

df = pd.DataFrame({'sentences': ['This is a very good site. I will recommend it to others.', 'Can you please give me a call at 9983938428. have issues with the listings.', 'good work! keep it up']})
df['tokenized_sents'] = df.apply(lambda row: nltk.word_tokenize(row['sentences']), axis=1)

Output:

>>> df
                                           sentences  \
0  This is a very good site. I will recommend it ...   
1  Can you please give me a call at 9983938428. h...   
2                              good work! keep it up   

                                     tokenized_sents  
0  [This, is, a, very, good, site, ., I, will, re...  
1  [Can, you, please, give, me, a, call, at, 9983...  
2                      [good, work, !, keep, it, up]

For finding the length of each text try to use apply and lambda function again:

df['sents_length'] = df.apply(lambda row: len(row['tokenized_sents']), axis=1)

>>> df
                                           sentences  \
0  This is a very good site. I will recommend it ...   
1  Can you please give me a call at 9983938428. h...   
2                              good work! keep it up   

                                     tokenized_sents  sents_length  
0  [This, is, a, very, good, site, ., I, will, re...            14  
1  [Can, you, please, give, me, a, call, at, 9983...            15  
2                      [good, work, !, keep, it, up]             6

edited Oct 13 '15 at 09:42

answered Oct 13 '15 at 09:00

ilyakhov

1,279
12
21

1

how can we do this when there are multiple rows in the dataframe? – eclairs Oct 13 '15 at 09:17
@eclairs, what do you mean? – ilyakhov Oct 13 '15 at 09:31
I am getting this error message when trying to tokenize: – eclairs Oct 13 '15 at 10:54
1

I am getting this error message when trying to tokenize: TypeError: ('expected string or buffer', u'occurred at index 1') – eclairs Oct 13 '15 at 11:01
I have not enough information about your case, write in question what dataframe do you use exactly. In your question data have wrong format. Have you tried to use my code running all operations step by step? Have it worked on your machine? – ilyakhov Oct 13 '15 at 11:12
There is one basic difference in your steps and mine. I have made a duplicate of another dataframe. i.e. my actual data frame is like: comment=pd.DataFrame(feedbacks, columns=['date','id', 'rating','comment','photos', 'home_info', 'neighbourhood', 'other_comment', 'uid', 'sid']).... from this, have created a duplicate- comments=comment[['comment']]... and then used the tokenize on this as - df['tokenized_words'] = comments.apply(lambda row: nltk.word_tokenize(row['comment']), axis=1).... Getting the error message at the last step... – eclairs Oct 13 '15 at 12:27
You have to modify current question ("how to use word_tokenize in data frame") or ask new question, because the subject of your last comment is out of the question. – ilyakhov Oct 14 '15 at 08:47

Harsha Manjunath · Answer 2 · 2016-07-11T21:04:05.787

pandas.Series.apply is faster than pandas.DataFrame.apply

import pandas as pd
import nltk

df = pd.read_csv("/path/to/file.csv")

start = time.time()
df["unigrams"] = df["verbatim"].apply(nltk.word_tokenize)
print "series.apply", (time.time() - start)

start = time.time()
df["unigrams2"] = df.apply(lambda row: nltk.word_tokenize(row["verbatim"]), axis=1)
print "dataframe.apply", (time.time() - start)

On a sample 125 MB csv file,

series.apply 144.428858995

dataframe.apply 201.884778976

Edit: You could be thinking the Dataframe df after series.apply(nltk.word_tokenize) is larger in size, which might affect the runtime for the next operation dataframe.apply(nltk.word_tokenize).

Pandas optimizes under the hood for such a scenario. I got a similar runtime of 200s by only performing dataframe.apply(nltk.word_tokenize) separately.

score 3 · Answer 3 · answered Oct 13 '20 at 05:40

I will show you an example. Suppose you have a data frame named twitter_df and you have stored sentiment and text within that. So, first I extract text data into a list as follows

 tweetText = twitter_df['text']

then to tokenize

 from nltk.tokenize import word_tokenize

 tweetText = tweetText.apply(word_tokenize)
 tweetText.head()

I think this will help you

Bryce Chamberlain · Answer 4 · 2019-02-22T21:22:25.400

May need to add str() to convert to pandas' object type to a string.

Keep in mind a faster way to count words is often to count spaces.

Interesting that tokenizer counts periods. May want to remove those first, maybe also remove numbers. Un-commenting the line below will result in equal counts, at least in this case.

import nltk
import pandas as pd

sentences = pd.Series([ 
    'This is a very good site. I will recommend it to others.',
    'Can you please give me a call at 9983938428. have issues with the listings.',
    'good work! keep it up',
    'not a very helpful site in finding home decor. '
])

# remove anything but characters and spaces
sentences = sentences.str.replace('[^A-z ]','').str.replace(' +',' ').str.strip()

splitwords = [ nltk.word_tokenize( str(sentence) ) for sentence in sentences ]
print(splitwords)
    # output: [['This', 'is', 'a', 'very', 'good', 'site', 'I', 'will', 'recommend', 'it', 'to', 'others'], ['Can', 'you', 'please', 'give', 'me', 'a', 'call', 'at', 'have', 'issues', 'with', 'the', 'listings'], ['good', 'work', 'keep', 'it', 'up'], ['not', 'a', 'very', 'helpful', 'site', 'in', 'finding', 'home', 'decor']]

wordcounts = [ len(words) for words in splitwords ]
print(wordcounts)
    # output: [12, 13, 5, 9]

wordcounts2 = [ sentence.count(' ') + 1 for sentence in sentences ]
print(wordcounts2)
    # output: [12, 13, 5, 9]

If you aren't using Pandas, you might not need str()

score 1 · Answer 5 · answered Mar 01 '22 at 17:42

Make it faster using pandarallel

Using Spacy

import spacy
from pandarallel import pandarallel

pandarallel.initialize(progress_bar=True)    
nlp = spacy.load("en_core_web_sm")

df['new_col'] = df['text'].parallel_apply(lambda x: nlp(x))

Using NLTK

import nltk
from pandarallel import pandarallel

pandarallel.initialize(progress_bar=True)

df['new_col'] = df['text'].parallel_apply(lambda x: nltk.word_tokenize(x))

how to use word_tokenize in data frame

5 Answers5

Linked

Related