0

I have a question about a pandas/NLTK issue.

My dataframe looks like the following:

Name    Age     Text
Anne    23     "foo you"
Joan    20     "woo you"
Marie   28     "boo you"
John    31     "moo you"
Mark    37     "loo you"

And I need to compute a new column, using the NLTK python library, that looks like the following:

Name    Age     Text        Tokens
Anne    23    "foo you"      ['foo','you']
Joan    20    "woo you"      ['woo','you']
Marie   28    "boo you"      ['boo','you']
John    31    "moo you"      ['moo','you']
Mark    37    "loo you"      ['loo','you']

I'm using the following code:

df['tokens'] = nltk.word_tokenize(df['text'])

But I get an error because It is storing one token per row, instead of all the tokens on its corresponding row.

Any help will be welcome.

Thank you very much in advance.

HRDSL
  • 711
  • 1
  • 5
  • 22

1 Answers1

0
df['Tokens'] = df['Text'].str.replace('"', '').apply(nltk.word_tokenize)
    Name    Age Text        Tokens
0   Anne    23  "foo you"   ['foo', 'you']
1   Joan    20  "woo you"   ['woo', 'you']
2   Marie   28  "boo you"   ['boo', 'you']
3   John    31  "moo you"   ['moo', 'you']
4   Mark    37  "loo you"   ['loo', 'you']

help-ukraine-now
  • 3,850
  • 4
  • 19
  • 36