43

The csv module in Python doesn't work properly when there's UTF-8/Unicode involved. I have found, in the Python documentation and on other webpages, snippets that work for specific cases but you have to understand well what encoding you are handling and use the appropriate snippet.

How can I read and write both strings and Unicode strings from .csv files that "just works" in Python 2.6? Or is this a limitation of Python 2.6 that has no simple solution?

Air
  • 8,274
  • 2
  • 53
  • 88
djen
  • 515
  • 1
  • 4
  • 7

10 Answers10

52

The example code of how to read Unicode given at http://docs.python.org/library/csv.html#examples looks to be obsolete, as it doesn't work with Python 2.6 and 2.7.

Here follows UnicodeDictReader which works with utf-8 and may be with other encodings, but I only tested it on utf-8 inputs.

The idea in short is to decode Unicode only after a csv row has been split into fields by csv.reader.

class UnicodeCsvReader(object):
    def __init__(self, f, encoding="utf-8", **kwargs):
        self.csv_reader = csv.reader(f, **kwargs)
        self.encoding = encoding

    def __iter__(self):
        return self

    def next(self):
        # read and split the csv row into fields
        row = self.csv_reader.next() 
        # now decode
        return [unicode(cell, self.encoding) for cell in row]

    @property
    def line_num(self):
        return self.csv_reader.line_num

class UnicodeDictReader(csv.DictReader):
    def __init__(self, f, encoding="utf-8", fieldnames=None, **kwds):
        csv.DictReader.__init__(self, f, fieldnames=fieldnames, **kwds)
        self.reader = UnicodeCsvReader(f, encoding=encoding, **kwds)

Usage (source file encoding is utf-8):

csv_lines = (
    "абв,123",
    "где,456",
)

for row in UnicodeCsvReader(csv_lines):
    for col in row:
        print(type(col), col)

Output:

$ python test.py
<type 'unicode'> абв
<type 'unicode'> 123
<type 'unicode'> где
<type 'unicode'> 456
Maxim Egorushkin
  • 131,725
  • 17
  • 180
  • 271
  • Worked very well for me. I did add BOM ('codecs.BOM_UTF8') to start of stream when saving for Excel. (The only residual issue I am wrestling with is why I am losing some line breaks when saving and then re-loading, but it could be my fault.) – Soferio Oct 04 '14 at 02:07
  • What could be your code for Python3? I tried yoour code with python 3 and keep getting the error: iter() returned non-iterator of type 'UnicodeCsvReader' – Jose Cabrera Zuniga Nov 16 '22 at 16:04
  • @JoseCabreraZuniga `python-3` strings are unicode, `read`/`write` do decode/encode for you. What is your problem? – Maxim Egorushkin Dec 10 '22 at 02:36
32

A little late answer, but I have used unicodecsv with great success.

Serafeim
  • 14,962
  • 14
  • 91
  • 133
  • 2
    I haven't tested it too much, but this package has a couple of advantages over the `ucsv` mentioned in @itsadok's answer: (1) It's fully packaged up for production use, including tests; and (2) it really tries to **just** add Unicode-awareness to the `csv` module, rather than silently converting any values that it can into numbers (somewhat like Excel). Granted, some people may *like* automatic conversion, but that's not something that Python's `csv` module ever intended. – John Y Feb 28 '13 at 16:52
  • 1
    unicodecsv.DictReader has one huge gotcha. You must open the file as binary for it to work as expected with unicode. Otherwise, you will still get a UnicodeEncodeError when you access a given row. fileHandle = io.open(file_path, "rb") my_unicode_dictionary = unicodecsv.DictReader(fileHandle) row = my_unicode_dictionary.next() – Ed J Nov 17 '14 at 23:27
22

The module provided here, looks like a cool, simple, drop-in replacement for the csv module that allows you to work with utf-8 csv.

import ucsv as csv
with open('some.csv', 'rb') as f:
    reader = csv.reader(f)
    for row in reader:
        print row
itsadok
  • 28,822
  • 30
  • 126
  • 171
  • 1
    Thanks for this, it also includes the super useful `DictReader` interface, which allows you to treat each row of the CSV file as a dictionary where the keys are taken from the fieldnames in the first row: –  Apr 14 '12 at 01:17
  • 1
    For some uses, you have to be careful with `ucsv`, since it tries to convert "numeric-looking" data for you. (It's easy enough to modify `ucsv` not to do that, but you just have to be aware it's there.) – John Y Feb 28 '13 at 16:56
  • I still get unicode errors with this module of the form `UnicodeDecodeError: 'utf8' codec can't decode byte 0xc9 in position 179: invalid continuation byte` – Mittenchops Jul 16 '13 at 15:02
  • @Mittenchops Sounds like your file is not valid utf8. – itsadok Jul 17 '13 at 04:23
7

There is the usage of Unicode example already in that doc, why still need to find another one or re-invent the wheel?

import csv

def unicode_csv_reader(unicode_csv_data, dialect=csv.excel, **kwargs):
    # csv.py doesn't do Unicode; encode temporarily as UTF-8:
    csv_reader = csv.reader(utf_8_encoder(unicode_csv_data),
                            dialect=dialect, **kwargs)
    for row in csv_reader:
        # decode UTF-8 back to Unicode, cell by cell:
        yield [unicode(cell, 'utf-8') for cell in row]

def utf_8_encoder(unicode_csv_data):
    for line in unicode_csv_data:
        yield line.encode('utf-8')
YOU
  • 120,166
  • 34
  • 186
  • 219
  • 9
    Doesn't work for me on linux: r = unicode_csv_reader(file('/tmp/csv-unicode.csv').read().split('\n')) ; r.next() . Gives: UnicodeDecodeError: 'ascii' codec can't decode byte 0xf8 in position 14: ordinal not in range(128) – Parand Feb 16 '11 at 07:13
4

I confirm, unicodecsv is a great replacement for the csv module, I've just replaced csv by unicodecsv in my source code, and it works like a charm.

Pierre GM
  • 19,809
  • 3
  • 56
  • 67
  • 1
    Try to add a comment to the answer you're agreeing with instead of creating a new answer (once you'll have more than 50rep) – Pierre GM Sep 19 '12 at 12:21
  • I know, but I don't have enough reputation to answer under an answer :-( Nevertheless, you should vote for my answer to gain reputation ;-) – Ludovic Gasc - GMLudo Sep 24 '12 at 13:19
3

The wrapper unicode_csv_reader mentioned in the python documentation accepts Unicode strings. This is because csv does not accept Unicode strings. cvs is probably not aware of encoding or locale and just treats the strings it gets as bytes. So what happens is that the wrapper encodes the Unicode strings, meaning that it creates a string of bytes. Then, when the wrapper gives back the results from csv, it decodes the bytes again, meaning that it converts the UTF-8 bytes sequences to the correct unicode characters.

If you give the wrapper a plain byte string e.g. by using f.readlines() it will give a UnicodeDecodeError on bytes with value > 127. You would use the wrapper in case you have unicode strings in your program that are in the CSV format.

I can imagine that the wrapper still has one limitation: since cvs does not accept unicode, and it also does not accept multi-byte delimiters, you can't parse files that have a unicode character as the delimiter.

Francisco R
  • 4,032
  • 1
  • 22
  • 37
gaston
  • 507
  • 4
  • 5
2

Maybe this is blatantly obvious, but for sake of beginners I'll mention it.

In python 3.X csv module supports any encoding out of the box, so if you use this version you can stick to the standard module.

 with open("foo.csv", encoding="utf-8") as f: 
     r = csv.reader(f, delimiter=";")
     for row in r: 
     print(row)

For additional discussion please see: Does python 3.1.3 support unicode in csv module?

Community
  • 1
  • 1
jb.
  • 23,300
  • 18
  • 98
  • 136
1

You should consider tablib, which has a completely different approach, but should be considered under the "just works" requirement.

with open('some.csv', 'rb') as f:
    csv = f.read().decode("utf-8")

import tablib
ds = tablib.Dataset()
ds.csv = csv
for row in ds.dict:
    print row["First name"]

Warning: tablib will reject your csv if it doesn't have the same number of items on every row.

itsadok
  • 28,822
  • 30
  • 126
  • 171
1

Here is an slightly improved version of Maxim's answer, which can also skip the UTF-8 BOM:

import csv
import codecs

class UnicodeCsvReader(object):
    def __init__(self, csv_file, encoding='utf-8', **kwargs):
        if encoding == 'utf-8-sig':
            # convert from utf-8-sig (= UTF8 with BOM) to plain utf-8 (without BOM):
            self.csv_file = codecs.EncodedFile(csv_file, 'utf-8', 'utf-8-sig')
            encoding = 'utf-8'
        else:
            self.csv_file = csv_file
        self.csv_reader = csv.reader(self.csv_file, **kwargs)
        self.encoding = encoding

    def __iter__(self):
        return self

    def next(self):
        # read and split the csv row into fields
        row = self.csv_reader.next() 
        # now decode
        return [unicode(cell, self.encoding) for cell in row]

    @property
    def line_num(self):
        return self.csv_reader.line_num

class UnicodeDictReader(csv.DictReader):
    def __init__(self, csv_file, encoding='utf-8', fieldnames=None, **kwds):
        reader = UnicodeCsvReader(csv_file, encoding=encoding, **kwds)
        csv.DictReader.__init__(self, reader.csv_file, fieldnames=fieldnames, **kwds)
        self.reader = reader

Note that the presence of the BOM is not automatically detected. You must signal it is there by passing the encoding='utf-8-sig' argument to the constructor of UnicodeCsvReader or UnicodeDictReader. Encoding utf-8-sig is utf-8 with a BOM.

Community
  • 1
  • 1
Matthias
  • 569
  • 1
  • 9
  • 30
0

I would add to itsadok's answer. By default, excel saves csv files as latin-1 (which ucsv does not support). You can easily fix this by:

with codecs.open(csv_path, 'rb', 'latin-1') as f:
    f = StringIO.StringIO( f.read().encode('utf-8') )

reader = ucsv.UnicodeReader(f)
# etc.
sam-6174
  • 3,104
  • 1
  • 33
  • 34