0

I want to create bag of words models but with calculate relative frequencies with nltk package. My data is built with pandas dataframe.

Here is my data:

text    title   authors label
0   On Saturday, September 17 at 8:30 pm EST, an e...   Another Terrorist Attack in NYC…Why Are we STI...   ['View All Posts', 'Leonora Cravotta']  Real
1   Story highlights "This, though, is certain: to...   Hillary Clinton on police shootings: 'too many...   ['Mj Lee', 'Cnn National Politics Reporter']    Real
2   Critical Counties is a CNN series exploring 11...   Critical counties: Wake County, NC, could put ...   ['Joyce Tseng', 'Eli Watkins']  Real
3   McCain Criticized Trump for Arpaio’s Pardon… S...   NFL Superstar Unleashes 4 Word Bombshell on Re...   []  Real
4   Story highlights Obams reaffirms US commitment...   Obama in NYC: 'We all have a role to play' in ...   ['Kevin Liptak', 'Cnn White House Producer']    Real
5   Obama weighs in on the debate\n\nPresident Bar...   Obama weighs in on the debate   ['Brianna Ehley', 'Jack Shafer']    Real

And I've tried to convert it into string

import nltk 
import numpy as np
import random
import bs4 as bs
import re

data = df.astype(str)
data

However, when I try to tokenize the word it has error like this

corpus = nltk.sent_tokenize(data['text'])

TypeError: expected string or bytes-like object

But It seems doesn't work:( Has anybody know how to tokenize the sentences each rows in a column ['text']?

  • `data['text']` is a pandas Series, not a String. You should probably try something like `data['token_text'] = data['text'].apply(sent_tokenize)` to add the result of nltk tokenization into new column. See https://stackoverflow.com/questions/44173624/how-to-apply-nltk-word-tokenize-library-on-a-pandas-dataframe-for-twitter-data for probable duplicate. – Beinje Mar 26 '20 at 09:10
  • I tried but I got error like this NameError: name 'sent_tokenize' is not defined even though I had imported nltk library @Beinje – Rosy Indah Permatasari Mar 26 '20 at 09:15
  • 1
    According to the nltk documentation, `sent_tokenize` function is part of `nltk.tokenize` module. So you need to replace `nltk.sent_tokenize()` by `nltk.tokenize.sent_tokenize()` – Beinje Mar 26 '20 at 09:32
  • Do you know how to tokenize the words from pandas dataframe by not creating a new column? I am quite confused.. (Sorry, I am still new to Python) @Beinje – Rosy Indah Permatasari Mar 26 '20 at 09:57

1 Answers1

0

nltk.tokenize() requires the input to be a string, you are getting the error because you are directly passing a pandas.Series object:

Try this to tokenize by words:

data['Corpus'] = df.text.apply(lambda x: nltk.word_tokenize(x))

For sent_tokenize modify:

data['Sent'] = df.text.apply(lambda x: nltk.sent_tokenize(x))

If you also want to get rid of the punctuation:

data['no_punc'] = df.text.apply(lambda x: nltk.RegexpTokenizer(r'\w+').tokenize(x))
ManojK
  • 1,570
  • 2
  • 9
  • 17
  • Do you know how to tokenize the words from pandas dataframe by not creating a new column? I am quite confused.. (Sorry, I am still new to Python) – Rosy Indah Permatasari Mar 26 '20 at 09:57
  • Just apply it to the existing column like this - `data['text'] = df.text.apply(lambda x: nltk.word_tokenize(x))` – ManojK Mar 26 '20 at 10:00
  • As I said in my comments, `nltk.tokenize` is a package, not a function, and you should call `nltk.tokenize.word_tokenize()`/`nltk.tokenize.sent_tokenize()` not `nltk.word_tokenize()`/`nltk.sent_tokenize()` – Beinje Mar 26 '20 at 10:08
  • @Beinje - Yes, It is one way to call but not the only solution, you can simply apply it to pandas dataframe directly as shown in my answer, you can read further [here](https://stackoverflow.com/questions/33098040/how-to-use-word-tokenize-in-data-frame) – ManojK Mar 26 '20 at 10:15
  • @RosyIndahPermatasari - I hope the solution worked to apply on existing column without creating a new one? – ManojK Mar 26 '20 at 10:16
  • 1
    @manojk didn't know that, useful yet confusing (I would have expect it to throw an error) ! – Beinje Mar 26 '20 at 10:27