UnicodeEncodeErrors while using DictWriter for utf-8

Question

I am trying to write a dictionary containing utf-8 strings to a CSV. I'm following the instructions from here. However, despite meticulously encoding and decoding these utf-8 strings, I am getting a UnicodeEncodeErrors involving 'ascii' sets.

I have a list of dictionaries which contain strings and ints as values related to changes to Wikipedia articles. The list below corresponds to this change, for example:

edgelist = [{'articleName': 'Barack Obama', 'editorName': 'Schonbrunn', 'revID': '121844749', 'bytesAdded': '183'}, 
{'articleName': 'Barack Obama', 'editorName': 'Eep\xc2\xb2', 'revID': '121862749', 'bytesAdded': '107'}]

The problem is list[1]['editorName']. It has type 'str' and el[1]['editorName'].decode('utf-8') is u'Eep\xb2'

The code I am attempting is:

_ENCODING = 'utf-8'
def dictToCSV(edgelist,output_file):
    with codecs.open(output_file,'wb',encoding=_ENCODING) as f:
        w = csv.DictWriter(f,sorted(edgelist[0].keys()))
        w.writeheader()
        for d in edgelist:
            for k,v in d.items():
                if type(v) == int:
                    d[k]=str(v).encode(_ENCODING)
            w.writerow({k:v.decode(_ENCODING) for k,v in d.items()})

This returns:

dictToCSV(edgelist,'test2.csv')
File "csv_to_charts.py", line 129, in dictToCSV
w.writerow({k:v.decode(_ENCODING,'ignore') for k,v in d.items()})
File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/csv.py", line 148, in writerow
return self.writer.writerow(self._dict_to_list(rowdict))
UnicodeEncodeError: 'ascii' codec can't encode character u'\xb2' in position 3: ordinal not in range(128)

Other permutations such as swapping decode for encode or nothing in the final problematic line also return errors:

w.writerow({k:v.encode(_ENCODING) for k,v in d.items()}) returns 'UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 56: ordinal not in range(128)
w.writerow({k:v for k,v in d.items()}) returns UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 56: ordinal not in range(128)
Following this, I changed with codecs.open(output_file,'wb',encoding=_ENCODING) as f: to with open(output_file,'wb') as f: and still receive the same error.

Excluding the list element(s) or the keys containing this problematic string, the script works fine otherwise.

Li Xiong · Accepted Answer · 2012-08-03T18:33:29.550

I just edited your code as follows and the csv was written successfully.

from django.utils.encoding import smart_str
import csv

def dictToCSV(edgelist, output_file):
    f = open(output_file, 'wb')
    w = csv.DictWriter(f, fieldnames=sorted(edgelist[0].keys()))
    w.writeheader()
    for d in edgelist:
        w.writerow(dict(k=smart_str(v)) for k, v in d.items())
    f.close()

Copy the Django code and customize it to your need.

score 0 · Answer 2 · edited Oct 07 '21 at 06:26

A strict interpretation of ASCII encoding only allows ordinals 0-127. Any value outside that range is not ASCII by definition. Since both \xc2 & \xb2 have ordinals higher than 127, they cannot be interpreted as ASCII.

I'm not a Python user, the RFC for CSV mentions ASCII as a common usage but defines an optional 'charset' parameter for the MIME type; I wonder if the writer you're using also might have an 'encoding' setting?

score 0 · Answer 3 · edited May 23 '17 at 11:51

Your strings are already in UTF-8, and DictWriter doesn't work with codecs.open. Following that example:

# coding: utf-8
import csv

edgelist = [
    {'articleName': 'Barack Obama', 'editorName': 'Schonbrunn', 'revID': '121844749', 'bytesAdded': '183'},
    {'articleName': 'Barack Obama', 'editorName': 'Eep\xc2\xb2', 'revID': '121862749', 'bytesAdded': '107'}]

with open('out.csv','wb') as f:
    f.write(u'\ufeff'.encode('utf8')) # BOM (optional...Excel needs it to open UTF-8 file properly)
    w = csv.DictWriter(f,sorted(edgelist[0].keys()))
    w.writeheader()
    for d in edgelist:
        w.writerow(d)

Output:

articleName,bytesAdded,editorName,revID
Barack Obama,183,Schonbrunn,121844749
Barack Obama,107,Eep²,121862749

Note, you can use 'editorName': 'Eep²' directly instead of 'editorName': 'Eep\xc2\xb2'. The byte string will be UTF-8-encoded per the # coding: utf-8 and if you save the source file in UTF-8.

UnicodeEncodeErrors while using DictWriter for utf-8

3 Answers3