2

I am a fairly new user of Python and I work mainly with imported text files, especially csv's, which give me headaches to process. I tried to read the docs like this one : https://docs.python.org/2/howto/unicode.html but I don't understand a clue of what is being said. I just want some straight down-to-earth explanation.

For instance I want to tokenize a large number of verbatims exported from the internet as a csv file. I want to use NLTK's tokenizer to do so.

Here's my code:

with open('verbatim.csv', 'r') as csvfile:
    reader = unicode_csv_reader(csvfile, dialect=csv.excel)
    for data in reader:
        tokens = nltk.word_tokenize(data)

When I do a print() on data I get clean text.

But when I use the tokenizer method, it returns the following error :

'ascii' codec can't decode byte 0xe9 in position 31: ordinal not in range(128)

It looks like an encoding problem. And it's always the same problem with every little manipulation I do with text. Can you help me with this ?

Stefanus
  • 1,619
  • 3
  • 12
  • 23
Nahid O.
  • 171
  • 1
  • 3
  • 14
  • where is the error? when reading the csv? or when tokenizing? I'm guessing you are using python 2? – Ale Apr 01 '16 at 15:39
  • 1
    Already answered in: http://stackoverflow.com/questions/904041/reading-a-utf8-csv-file-with-python – alexisdevarennes Apr 01 '16 at 15:40
  • You can use https://pypi.python.org/pypi/unicodecsv replace csv with unicodecsv, and done :-) – Ale Apr 01 '16 at 15:42
  • Yes I am using Python 2.7. The error shows up whenever I try to work with NLTK. For instance the tokenizer will work for the first few lines of text but I guess it will crash as soon as there is a special character (accents...) – Nahid O. Apr 01 '16 at 15:42
  • Switch to Python 3. Python 2 is notoriously bad for NLP tasks; NLTK has supported Python 3 well since version 3.0. In Python 3, `csv.reader` accepts an encoding, [as shown in an answer to the duplicate target](http://stackoverflow.com/a/14786752/918959). – Antti Haapala -- Слава Україні Apr 01 '16 at 21:09

2 Answers2

2

This should do it:

with open('verbatim.csv') as csvfile:  # No need to set mode to 'r', r is default
    reader = unicode_csv_reader(csvfile, dialect=csv.excel)
    for data in reader:
        tokens = nltk.word_tokenize(unicode(data, 'utf-8'))

otherwise you can also try:

import codecs
with codecs.open('verbatim.csv', encoding='utf-8') as csvfile:
        reader = unicode_csv_reader(csvfile, dialect=csv.excel)
        for data in reader:
            tokens = nltk.word_tokenize(data)
alexisdevarennes
  • 5,437
  • 4
  • 24
  • 38
0

First you have to understand that str and unicode are two different types.

There is a lot of documentation and great talks about the subject. I think this is one of the best: https://www.youtube.com/watch?v=sgHbC6udIqc

If you are going to work with text you should really understand the differences.

Overly simplified, str is a sequence of bytes. unicode is a sequence of "characters" (code points), to get a sequence of bytes to you encode the unicode object with and encoding.

Yes, complicated. My suggestion, watch the video.

I'm not sure what your unicode_csv_reader does but I'm guessing the problem is there as nltk works with unicode. So I'm guessing that in unicode_csv_reader you are trying to encode/decode something with the wrong codec.

In [1]: import nltk

In [2]: nltk.word_tokenize(u'mi papá tiene 100 años')
Out[2]: [u'mi', u'pap\xe1', u'tiene', u'100', u'a\xf1os']

I would use the package unicodecsv from pypi. Which returns a list of unicode objects for each line that you can pass to nltk.

import unicodecsv
with open('verbatim.csv', 'r') as csvfile:
    reader = unicodecsv.reader(csvfile, dialect=csv.excel, encoding='iso-8859-1')
    for data in reader:
        tokens = nltk.word_tokenize(data)

you can provide and encoding to the reader, and there's no need to use codecs to open the file.

Ale
  • 1,315
  • 9
  • 19
  • Thanks for all your answers. I have not resolved my problem with any of your advices but I will dig deeper on the subject. – Nahid O. Apr 04 '16 at 12:54
  • @NahidO. post your data somewhere, maybe if we can take a look we can help. – Ale Apr 04 '16 at 17:18
  • Thanks a lot ! Here is a sample data : https://www.dropbox.com/s/890mu8y9mq3cxw7/verbatim%20-%20stackoverflow.csv?dl=0 – Nahid O. Apr 05 '16 at 07:23
  • your data is encoded in iso-8859-1 so the check the update to my answer. – Ale Apr 05 '16 at 12:46