Python DictWriter writing UTF-8 encoded CSV files

Question

I have a list of dictionaries containing unicode strings.
csv.DictWriter can write a list of dictionaries into a CSV file.
I want the CSV file to be encoded in UTF8.
The csv module cannot handle converting unicode strings into UTF8.

The csv module documentation has an example for converting everything to UTF8:

def utf_8_encoder(unicode_csv_data):
    for line in unicode_csv_data:
        yield line.encode('utf-8')

It also has a UnicodeWriter class.

But... how do I make DictWriter work with these? Wouldn't they have to inject themselves in the middle of it, to catch the disassembled dictionaries and encode them before it writes them to the file? I don't get it.

Mark Tolonen · Accepted Answer · 2020-09-01T02:43:27.300

UPDATE: The 3rd party unicodecsv module implements this 7-year old answer for you. Example below this code. There's also a Python 3 solution that doesn't required a 3rd party module.

Original Python 2 Answer

If using Python 2.7 or later, use a dict comprehension to remap the dictionary to utf-8 before passing to DictWriter:

# coding: utf-8
import csv
D = {'name':u'马克','pinyin':u'mǎkè'}
f = open('out.csv','wb')
f.write(u'\ufeff'.encode('utf8')) # BOM (optional...Excel needs it to open UTF-8 file properly)
w = csv.DictWriter(f,sorted(D.keys()))
w.writeheader()
w.writerow({k:v.encode('utf8') for k,v in D.items()})
f.close()

You can use this idea to update UnicodeWriter to DictUnicodeWriter:

# coding: utf-8
import csv
import cStringIO
import codecs

class DictUnicodeWriter(object):

    def __init__(self, f, fieldnames, dialect=csv.excel, encoding="utf-8", **kwds):
        # Redirect output to a queue
        self.queue = cStringIO.StringIO()
        self.writer = csv.DictWriter(self.queue, fieldnames, dialect=dialect, **kwds)
        self.stream = f
        self.encoder = codecs.getincrementalencoder(encoding)()

    def writerow(self, D):
        self.writer.writerow({k:v.encode("utf-8") for k,v in D.items()})
        # Fetch UTF-8 output from the queue ...
        data = self.queue.getvalue()
        data = data.decode("utf-8")
        # ... and reencode it into the target encoding
        data = self.encoder.encode(data)
        # write to the target stream
        self.stream.write(data)
        # empty queue
        self.queue.truncate(0)

    def writerows(self, rows):
        for D in rows:
            self.writerow(D)

    def writeheader(self):
        self.writer.writeheader()

D1 = {'name':u'马克','pinyin':u'Mǎkè'}
D2 = {'name':u'美国','pinyin':u'Měiguó'}
f = open('out.csv','wb')
f.write(u'\ufeff'.encode('utf8')) # BOM (optional...Excel needs it to open UTF-8 file properly)
w = DictUnicodeWriter(f,sorted(D.keys()))
w.writeheader()
w.writerows([D1,D2])
f.close()

Python 2 unicodecsv Example:

# coding: utf-8
import unicodecsv as csv

D = {u'name':u'马克',u'pinyin':u'mǎkè'}

with open('out.csv','wb') as f:
    w = csv.DictWriter(f,fieldnames=sorted(D.keys()),encoding='utf-8-sig')
    w.writeheader()
    w.writerow(D)

Python 3:

Additionally, Python 3's built-in csv module supports Unicode natively:

# coding: utf-8
import csv

D = {u'name':u'马克',u'pinyin':u'mǎkè'}

# Use newline='' instead of 'wb' in Python 3.
with open('out.csv','w',encoding='utf-8-sig',newline='') as f:
    w = csv.DictWriter(f,fieldnames=sorted(D.keys()))
    w.writeheader()
    w.writerow(D)

I thought downgrading to Python(x, y) 2.6.6.0 would make things easier. :) — endolith, Apr 30 '11 at 01:50
@endolith: You could use `dict((k, v.encode('utf-8') if isinstance(v, unicode) else v) for k, v in D.iteritems())` instead of dict comprehension on Python 2.6. — jfs, Apr 30 '11 at 05:37
If `v` is not Unicode, and not UTF8 encoded, you now have a mess, though. — Mark Tolonen, Oct 08 '17 at 01:34
This class worked beautifully for me writing data from an API with JSON output. However, some of the incoming data was parsed as floats, so I'd get exceptions about float not having the method `encode()`. To fix this, I simply wrapped `v` with the `unicode()` (_not_ `str()`!) constructor in `writerow()`: `self.writer.writerow({k:unicode(v).encode("utf-8") for k,v in D.items()})`. This should fix what @Mark Tolonen mentions. — Demonslay335, Apr 10 '18 at 20:13
@leonwu This is a 7-year old answer, but yes `unicodecsv` (and Python 3) are better answers now. Updated... — Mark Tolonen, Oct 19 '18 at 22:57

rlafuente · Answer 2 · 2017-12-24T11:39:24.930

41

There is a simple workaround using the wonderful UnicodeCSV module. After having it, just change the line

import csv

to

import unicodecsv as csv

And it automagically begins playing nice with UTF-8.

Note: Switching to Python 3 will also rid you of this problem (thanks jamescampbell for the tip). And it's something one should do anyway.

edited Dec 24 '17 at 11:39

answered Mar 07 '16 at 00:06

rlafuente

1,854
2
14
14

6

omfg finally - what a nightmare this has been until this – Aurielle Perlmann Jun 18 '16 at 07:24
4

This should be the accepted answer - so simple and works like a charm – Alexey Grigorev Oct 23 '16 at 16:38
1

You no longer need to do this in Python 3.x – james-see Dec 15 '17 at 17:46
You're absolutely right, updated my answer to include this detail. – rlafuente Dec 24 '17 at 11:40
1

Python 3 still doesn't take care of the utf8 bom. – Jonathan Jul 04 '18 at 13:40
I'm using Python 3 and still got a unicode error, trying this solution got rid of the unicode error, but gave me 'TypeError: write() argument must be str, not bytes' – Maitiu Aug 02 '18 at 11:13
@Jonathan I recently updated my answer with unicodecsv and Python 3 examples. These an encoding that hangles the UTF-8 signature. – Mark Tolonen Oct 19 '18 at 23:04
@Maitiu See my updated answer for a Python 3 example. Python 3 strings are natively Unicode so you don't open the file in binary mode. – Mark Tolonen Oct 19 '18 at 23:05

score 15 · Answer 3 · answered Apr 30 '11 at 01:06

You can convert the values to UTF-8 on the fly as you pass the dict to DictWriter.writerow(). For example:

import csv

rows = [
    {'name': u'Anton\xedn Dvo\u0159\xe1k','country': u'\u010cesko'},
    {'name': u'Bj\xf6rk Gu\xf0mundsd\xf3ttir', 'country': u'\xcdsland'},
    {'name': u'S\xf8ren Kierkeg\xe5rd', 'country': u'Danmark'}
    ]

# implement this wrapper on 2.6 or lower if you need to output a header
class DictWriterEx(csv.DictWriter):
    def writeheader(self):
        header = dict(zip(self.fieldnames, self.fieldnames))
        self.writerow(header)

out = open('foo.csv', 'wb')
writer = DictWriterEx(out, fieldnames=['name','country'])
# DictWriter.writeheader() was added in 2.7 (use class above for <= 2.6)
writer.writeheader()
for row in rows:
    writer.writerow(dict((k, v.encode('utf-8')) for k, v in row.iteritems()))
out.close()

Output foo.csv:

name,country
Antonín Dvořák,Česko
Björk Guðmundsdóttir,Ísland
Søren Kierkegård,Danmark

Nice one. I liked the implementation of one liner writerow func. — shahjapan, Feb 11 '13 at 15:04
`writer.writerow(dict((k, v.encode('utf-8') if type(v) is unicode else v) for k, v in row.iteritems()))` encode only for unicode characters. because int/list don't have unicode attribute. — arulraj.net, Nov 06 '14 at 02:12

score 6 · Answer 4 · edited Apr 30 '11 at 02:19

You can use some proxy class to encode dict values as needed, like this:

# -*- coding: utf-8 -*- 
import csv
d = {'a':123,'b':456, 'c':u'Non-ASCII: проверка'}

class DictUnicodeProxy(object):
    def __init__(self, d):
        self.d = d
    def __iter__(self):
        return self.d.__iter__()
    def get(self, item, default=None):
        i = self.d.get(item, default)
        if isinstance(i, unicode):
            return i.encode('utf-8')
        return i

with open('some.csv', 'wb') as f:
    writer = csv.DictWriter(f, ['a', 'b', 'c'])
    writer.writerow(DictUnicodeProxy(d))

score 2 · Answer 5 · answered Apr 30 '11 at 00:47

2

When you call csv.writer with your content, the idea is to pass the content through utf_8_encoder as it would give you the (utf-8) encoded content.

answered Apr 30 '11 at 00:47

Senthil Kumaran

54,681
14
94
131

b1r3k · Answer 6 · 2013-09-30T15:05:32.290

My solution is a bit different. While all solutions above are focusing on having unicode compatible dict, my solutions makes DictWriter compatible with unicode. This approach is even suggested in python docs (1).

Classes UTF8Recoder, UnicodeReader, UnicodeWriter are taken from python docs. UnicodeWriter->writerow was changed a little bit too.

Use it as regular DictWriter/DictReader.

Here is the code:

import csv, codecs, cStringIO

class UTF8Recoder:
    """
    Iterator that reads an encoded stream and reencodes the input to UTF-8
    """
    def __init__(self, f, encoding):
        self.reader = codecs.getreader(encoding)(f)

    def __iter__(self):
        return self

    def next(self):
        return self.reader.next().encode("utf-8")

class UnicodeReader:
    """
    A CSV reader which will iterate over lines in the CSV file "f",
    which is encoded in the given encoding.
    """

    def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds):
        f = UTF8Recoder(f, encoding)
        self.reader = csv.reader(f, dialect=dialect, **kwds)

    def next(self):
        row = self.reader.next()
        return [unicode(s, "utf-8") for s in row]

    def __iter__(self):
        return self

class UnicodeWriter:
    """
    A CSV writer which will write rows to CSV file "f",
    which is encoded in the given encoding.
    """

    def __init__(self, f, dialect=csv.excel, encoding="utf-8", **kwds):
        # Redirect output to a queue
        self.queue = cStringIO.StringIO()
        self.writer = csv.writer(self.queue, dialect=dialect, **kwds)
        self.stream = f
        self.encoder = codecs.getincrementalencoder(encoding)()

    def writerow(self, row):
        self.writer.writerow([unicode(s).encode("utf-8") for s in row])
        # Fetch UTF-8 output from the queue ...
        data = self.queue.getvalue()
        data = data.decode("utf-8")
        # ... and reencode it into the target encoding
        data = self.encoder.encode(data)
        # write to the target stream
        self.stream.write(data)
        # empty queue
        self.queue.truncate(0)

    def writerows(self, rows):
        for row in rows:
            self.writerow(row)

class UnicodeDictWriter(csv.DictWriter, object):
    def __init__(self, f, fieldnames, restval="", extrasaction="raise", dialect="excel", *args, **kwds):
        super(UnicodeDictWriter, self).__init__(f, fieldnames, restval="", extrasaction="raise", dialect="excel", *args, **kwds)
        self.writer = UnicodeWriter(f, dialect, **kwds)

Python DictWriter writing UTF-8 encoded CSV files

6 Answers6

Linked

Related