0

I have a csv file with three columns, and I want to loop through the content of the column 'text' and tokenize (splitting by strings of only letters and apostrophes) every cell from it.

This does not seem to work:

tokenizer = RegexpTokenizer("[a-zA-Z'`éèî]+")
for x in data['text']:
     x = tokenizer.tokenize(x)

The error I get is TypeError: expected string or bytes-like object

Mbaps
  • 143
  • 1
  • 1
  • 11
  • What is the content of `data['text']`? If it's neither a string nor `bytes`, what is it? Try `type(x)` before the line that tokenizes. – alexis Oct 16 '17 at 19:06
  • Firstly, use Python3, then try `data['text'] = data['text'].astype(str)` – alvas Oct 17 '17 at 02:17
  • See also https://stackoverflow.com/questions/44173624/how-to-nltk-word-tokenize-to-a-pandas-dataframe-for-twitter-data/44174565#44174565 – alvas Oct 17 '17 at 03:31

1 Answers1

0

From the documentation:

Caution: when tokenizing a Unicode string, make sure you are not using an encoded version of the string (it may be necessary to decode it first, e.g. with s.decode("utf8").

So try:

tokenizer = RegexpTokenizer("[a-zA-Z'`éèî]+")
for x in data['text']:
     x = tokenizer.tokenize(x.decode("utf8"))