Trouble on Unicode encoded data in Python

Question

Hello StackOverflow community.

I am a fairly new user of Python, so sorry in advance for the sillyness of this question ! But I have tried to fix it out for hours but still not having figured it out.

I am trying to import a large dataset of text to manipulate it in Python.

This data set is in .csv and I've had problems reading it because of encoding problems.

I have tried to encode it in UTF-8 text with notepad++ I have tried the csv.reader module in Python

Here is an example of my code :

import csv
with open('twitter_test_python.csv') as csvfile:
    #for file5 in csvfile:
    #    file5.readline()
    #csvfile = csvfile.encode('utf-8')
    spamreader = csv.reader(csvfile, delimiter=str(','), quotechar=str('|')
    for row in spamreader:
        row = " ".join(row)
        row2= str.split(row)
    listsw = []
    for mots in row2:
        if mots not in sw:
            del mots
    print row2

But when I import my data in Python I still have encoding problems (accents, etc) whether method I use.

How can I encode my data so that it is readable properly with Python ?

Thanks !

*I still have encoding problems* means exactly nothing! Say what happens exactly and what is expected. — Serge Ballesta, Mar 21 '16 at 13:12
Here is an example of a list from my data : [u"En vrai j'en ai marre j'ai une poste \xe0 3min de chez moi et le postier il d\xe9cide de mettre mon colis dans une poste que je connais pas"] . — Nahid O., Mar 21 '16 at 13:48
I want to have that : [En vrai j'en ai marre j'ai une poste à 3min de chez moi et le postier il décide de mettre mon colis dans une poste que je connais pas] — Nahid O., Mar 21 '16 at 13:54
Then, *pas de problème*. When I type `print u"En vrai j'en ai marre j'ai une poste \xe0 3min de chez moi et le postier il d\xe9cide de mettre mon colis dans une poste que je connais pas"` on IDLE I get correctly `En vrai j'en ai marre j'ai une poste à 3min de chez moi et le postier il décide de mettre mon colis dans une poste que je connais pas`. It means that your data is a correct unicode string containing the correct unicode accented characters. Said differently, you have no encoding problem when reading the data but you may have when displaying it. — Serge Ballesta, Mar 21 '16 at 14:30

score 0 · Answer 1 · answered Mar 21 '16 at 12:42

csv module documentation provides an example of how to deal with unicode:

import csv,codecs,cStringIO

class UTF8Recoder:
    def __init__(self, f, encoding):
        self.reader = codecs.getreader(encoding)(f)
    def __iter__(self):
        return self
    def next(self):
        return self.reader.next().encode("utf-8")

class UnicodeReader:
    def __init__(self, f, dialect=csv.excel, encoding="utf-8-sig", **kwds):
        f = UTF8Recoder(f, encoding)
        self.reader = csv.reader(f, dialect=dialect, **kwds)
    def next(self):
        '''next() -> unicode
        This function reads and returns the next line as a Unicode string.
        '''
        row = self.reader.next()
        return [unicode(s, "utf-8") for s in row]
    def __iter__(self):
        return self

class UnicodeWriter:
    def __init__(self, f, dialect=csv.excel, encoding="utf-8-sig", **kwds):
        self.queue = cStringIO.StringIO()
        self.writer = csv.writer(self.queue, dialect=dialect, **kwds)
        self.stream = f
        self.encoder = codecs.getincrementalencoder(encoding)()
    def writerow(self, row):
        '''writerow(unicode) -> None
        This function takes a Unicode string and encodes it to the output.
        '''
        self.writer.writerow([s.encode("utf-8") for s in row])
        data = self.queue.getvalue()
        data = data.decode("utf-8")
        data = self.encoder.encode(data)
        self.stream.write(data)
        self.queue.truncate(0)

    def writerows(self, rows):
        for row in rows:
            self.writerow(row)

with open('twitter_test_python.csv','rb') as spamreader:
    reader = UnicodeReader(fin)
    for line in reader:
        #do stuff
        print line

score 0 · Answer 2 · edited May 23 '17 at 12:24

Alexey Smirnov's answer is elegant but maybe a bit complicated for a beginner. So let me give an example closer to the code in the question.

When you read in files with Python 2 you get the content as str, not unicode. Probably you want to convert it as soon as possible. However, the documentation of the csv module says "This version of the csv module doesn’t support Unicode input." So you should encode the output of csv.reader, not the input. Inserting it into your code results in:

import csv
with open('twitter_test_python.csv') as csvfile:
    spamreader = csv.reader(csvfile, delimiter=str(','), quotechar=str('|'))
    for row in spamreader:
        row = " ".join(row)
        row = unicode(row, encoding="utf-8")
        row2 = row.split()

However, you might want to consider whether joining the cells just to split them again is really what you want. Without that the code would look like following. The result is different if the list elements contain spaces.

import csv
with open('twitter_test_python.csv') as csvfile:
    spamreader = csv.reader(csvfile, delimiter=str(','), quotechar=str('|'))
    for row in spamreader:
        row2 = list(unicode(cell, encoding="utf-8") for cell in row)

If you want to write something back to a file you should convert the unicode first back to a str like unicode.encode("utf-8").

Thanks for the answer. So it means that I can work with the data even if it doesn't look good in Python ? — Nahid O., Mar 21 '16 at 13:47
Yes, you can process unicode with python. Considering your new comments I assume you are referring to the output of `print`. You might be interested in the differences of [`str()` vs `repr()`](http://stackoverflow.com/questions/19331404/str-vs-repr-functions-in-python-2-7-5). `print` uses the `str`-representation. A `list` uses for it's `str`-representation the `repr`-representation of it's elements. To get your desired output use `print "[" + ", ".join(row2) + "]"`. — jakun, Mar 21 '16 at 14:15

Trouble on Unicode encoded data in Python

2 Answers2