2

I have a pandas dataframe df of the form:

df = pd.DataFrame.from_dict({'ID':[1,2,3], \
'Strings':['Hello, how are you?', 'Nice to meet you!', 'My name is John.']})

I want to tokenize the Strings column and create a new data frame new_df:

Sentence    Word
   0        Hello
   0        ,
   0        how
   0        are
   0        you
   0        ?
   1        Nice
   1        to
   1        meet
   1        you
   1        .
   2        My
   2        name
   2        is
   2        John
   2        .

I know for tokenization I can possibly use nltk.word_tokenize() for evert string in df, but how do I get from that point to new_df in a manner that is efficient?

Melsauce
  • 2,535
  • 2
  • 19
  • 39

2 Answers2

3

You can do this with map and stack:

import nltk
pd.DataFrame(df.Strings.map(nltk.word_tokenize).tolist(), index=df.ID).stack()

To clean up the index, use reset_index.

(pd.DataFrame(df.Strings.map(nltk.word_tokenize).tolist(), index=df.ID)
   .stack()
   .reset_index(level=1, drop=True)
   .reset_index(name='Word'))

    ID   Word
0    1  Hello
1    1      ,
2    1    how
3    1    are
4    1    you
5    1      ?
6    2   Nice
7    2     to
8    2   meet
9    2    you
10   2      !
11   3     My
12   3   name
13   3     is
14   3   John
15   3      .
cs95
  • 379,657
  • 97
  • 704
  • 746
2

After nltk the problem became unnesting

df.Strings=df.Strings.map(nltk.word_tokenize).tolist()

unnesting(df,['Strings'])
Out[22]: 
  Strings  ID
0   Hello   1
0       ,   1
0     how   1
0     are   1
0     you   1
0       ?   1
1    Nice   2
1      to   2
1    meet   2
1     you   2
1       !   2
2      My   3
2    name   3
2      is   3
2    John   3
2       .   3
BENY
  • 317,841
  • 20
  • 164
  • 234