Tokenize content of column in csv using Panda and Nltk

Question

I have a csv file with three columns, and I want to loop through the content of the column 'text' and tokenize (splitting by strings of only letters and apostrophes) every cell from it.

This does not seem to work:

tokenizer = RegexpTokenizer("[a-zA-Z'`éèî]+")
for x in data['text']:
     x = tokenizer.tokenize(x)

The error I get is TypeError: expected string or bytes-like object

What is the content of `data['text']`? If it's neither a string nor `bytes`, what is it? Try `type(x)` before the line that tokenizes. — alexis, Oct 16 '17 at 19:06
Firstly, use Python3, then try `data['text'] = data['text'].astype(str)` — alvas, Oct 17 '17 at 02:17
See also https://stackoverflow.com/questions/44173624/how-to-nltk-word-tokenize-to-a-pandas-dataframe-for-twitter-data/44174565#44174565 — alvas, Oct 17 '17 at 03:31

score 0 · Answer 1 · answered Oct 16 '17 at 18:03

From the documentation:

Caution: when tokenizing a Unicode string, make sure you are not using an encoded version of the string (it may be necessary to decode it first, e.g. with s.decode("utf8").

So try:

tokenizer = RegexpTokenizer("[a-zA-Z'`éèî]+")
for x in data['text']:
     x = tokenizer.tokenize(x.decode("utf8"))

Tokenize content of column in csv using Panda and Nltk

1 Answers1