How to tokenize a text column in dataframe using NLTK

Question

my df looks like this:

team_name   text
---------   ----
red         this is text from red team
blue        this is text from blue team
green       this is text from green team
yellow      this is text from yellow team

I am trying to get this:

team_name   text                             text_token
---------   ----                             ----------
red         this is text from red team       'this', 'is', 'text', 'from', 'red','team'
blue        this is text from blue team      'this', 'is', 'text', 'from', 'blue','team'
green       this is text from green team     'this', 'is', 'text', 'from', 'green','team'
yellow      this is text from yellow team    'this', 'is', 'text', 'from', 'yellow','team'

What have I tried?

df['text_token'] = nltk.word_tokenize(df['text'])

and that does not work. How do I achieve my desired result? also is it possible to do frequency dist?

https://stackoverflow.com/questions/44173624/how-to-apply-nltk-word-tokenize-library-on-a-pandas-dataframe-for-twitter-data. and. https://stackoverflow.com/questions/33098040/how-to-use-word-tokenize-in-data-frame — Joe Ferndz, Jan 03 '21 at 02:48
`df['text_token'] = df.apply(lambda row: nltk.word_tokenize(row['text']), axis=1)` — Joe Ferndz, Jan 03 '21 at 02:50
Does this answer your question? [how to use word\_tokenize in data frame](https://stackoverflow.com/questions/33098040/how-to-use-word-tokenize-in-data-frame) — Lydia van Dyke, Jan 03 '21 at 09:27

score 1 · Accepted Answer · answered Jan 03 '21 at 02:51

1

Stack overflow has a few examples for you to look into.

This has been solved in link : how to use word_tokenize in data frame

df['text_token'] = df.apply(lambda row: nltk.word_tokenize(row['text']), axis=1)

answered Jan 03 '21 at 02:51

Joe Ferndz

8,417
2
13
33

thanks for composing the answer. how to I omit `NA` values? – floss Jan 04 '21 at 03:56
1

use df['column'].fillna(value=myValue, inplace=True) – Joe Ferndz Jan 04 '21 at 04:53
thanks!! how do I get the `Freq Dist` of each row of `text_token`? – floss Jan 04 '21 at 04:55

How to tokenize a text column in dataframe using NLTK

1 Answers1