Using NLTK tokenizer with utf8

Question

I am a fairly new user of Python and I work mainly with imported text files, especially csv's, which give me headaches to process. I tried to read the docs like this one : https://docs.python.org/2/howto/unicode.html but I don't understand a clue of what is being said. I just want some straight down-to-earth explanation.

For instance I want to tokenize a large number of verbatims exported from the internet as a csv file. I want to use NLTK's tokenizer to do so.

Here's my code:

with open('verbatim.csv', 'r') as csvfile:
    reader = unicode_csv_reader(csvfile, dialect=csv.excel)
    for data in reader:
        tokens = nltk.word_tokenize(data)

When I do a print() on data I get clean text.

But when I use the tokenizer method, it returns the following error :

'ascii' codec can't decode byte 0xe9 in position 31: ordinal not in range(128)

It looks like an encoding problem. And it's always the same problem with every little manipulation I do with text. Can you help me with this ?

where is the error? when reading the csv? or when tokenizing? I'm guessing you are using python 2? — Ale, Apr 01 '16 at 15:39
Already answered in: http://stackoverflow.com/questions/904041/reading-a-utf8-csv-file-with-python — alexisdevarennes, Apr 01 '16 at 15:40
You can use https://pypi.python.org/pypi/unicodecsv replace csv with unicodecsv, and done :-) — Ale, Apr 01 '16 at 15:42
Yes I am using Python 2.7. The error shows up whenever I try to work with NLTK. For instance the tokenizer will work for the first few lines of text but I guess it will crash as soon as there is a special character (accents...) — Nahid O., Apr 01 '16 at 15:42
Switch to Python 3. Python 2 is notoriously bad for NLP tasks; NLTK has supported Python 3 well since version 3.0. In Python 3, `csv.reader` accepts an encoding, [as shown in an answer to the duplicate target](http://stackoverflow.com/a/14786752/918959). — Antti Haapala -- Слава Україні, Apr 01 '16 at 21:09

score 2 · Answer 1 · answered Apr 01 '16 at 15:42

2

This should do it:

with open('verbatim.csv') as csvfile:  # No need to set mode to 'r', r is default
    reader = unicode_csv_reader(csvfile, dialect=csv.excel)
    for data in reader:
        tokens = nltk.word_tokenize(unicode(data, 'utf-8'))

otherwise you can also try:

import codecs
with codecs.open('verbatim.csv', encoding='utf-8') as csvfile:
        reader = unicode_csv_reader(csvfile, dialect=csv.excel)
        for data in reader:
            tokens = nltk.word_tokenize(data)

answered Apr 01 '16 at 15:42

alexisdevarennes

5,437
4
24
38

Thanks for the answer. But now I get a list which is not a valid format for NLTK. Here is the errror : coercing to Unicode: need string or buffer, list found – Nahid O. Apr 01 '16 at 15:54
So the csv reader is return a row of values. can you print data and paste the output here? – alexisdevarennes Apr 01 '16 at 15:57
Try the second alternative otherwise. – alexisdevarennes Apr 01 '16 at 15:57

Ale · Answer 2 · 2016-04-05T12:48:35.057

First you have to understand that str and unicode are two different types.

There is a lot of documentation and great talks about the subject. I think this is one of the best: https://www.youtube.com/watch?v=sgHbC6udIqc

If you are going to work with text you should really understand the differences.

Overly simplified, str is a sequence of bytes. unicode is a sequence of "characters" (code points), to get a sequence of bytes to you encode the unicode object with and encoding.

Yes, complicated. My suggestion, watch the video.

I'm not sure what your unicode_csv_reader does but I'm guessing the problem is there as nltk works with unicode. So I'm guessing that in unicode_csv_reader you are trying to encode/decode something with the wrong codec.

In [1]: import nltk

In [2]: nltk.word_tokenize(u'mi papá tiene 100 años')
Out[2]: [u'mi', u'pap\xe1', u'tiene', u'100', u'a\xf1os']

I would use the package unicodecsv from pypi. Which returns a list of unicode objects for each line that you can pass to nltk.

import unicodecsv
with open('verbatim.csv', 'r') as csvfile:
    reader = unicodecsv.reader(csvfile, dialect=csv.excel, encoding='iso-8859-1')
    for data in reader:
        tokens = nltk.word_tokenize(data)

you can provide and encoding to the reader, and there's no need to use codecs to open the file.

Thanks for all your answers. I have not resolved my problem with any of your advices but I will dig deeper on the subject. — Nahid O., Apr 04 '16 at 12:54
@NahidO. post your data somewhere, maybe if we can take a look we can help. — Ale, Apr 04 '16 at 17:18
Thanks a lot ! Here is a sample data : https://www.dropbox.com/s/890mu8y9mq3cxw7/verbatim%20-%20stackoverflow.csv?dl=0 — Nahid O., Apr 05 '16 at 07:23
your data is encoded in iso-8859-1 so the check the update to my answer. — Ale, Apr 05 '16 at 12:46

Using NLTK tokenizer with utf8

2 Answers2

Linked

Related