How to avoid unicodeError?

Question

I'm trying to write to a file and I get the following error:

Traceback (most recent call last):
  File "/private/var/folders/jv/9_sy0bn10mbdft1bk9t14qz40000gn/T/Cleanup At Startup/merge-395780681.888.py", line 151, in <module>
    gc_all_d.writerow(row)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/csv.py", line 148, in writerow
    return self.writer.writerow(self._dict_to_list(rowdict))
UnicodeEncodeError: 'ascii' codec can't encode character u'\u0329' in position 5: ordinal not in range(128)

The error occurs after I try to write a row from a database of counselors to a file that is aggregating their names:

# compile master spreadsheet
with(open('gc_all.txt_3','w')) as gc_all:
    gc_all_d = csv.DictWriter(gc_all,  fieldnames = fieldnames, extrasaction='ignore', delimiter = '\t') 
    gc_all_d.writeheader()
    for row in aicep_l:
        print row['name']
        gc_all_d.writerow(row)
    for row in nbcc_l:
        gc_all_d.writerow(row)
        print row['name']

I'm in unfamiliar waters here. I don't see a parameter in the writerow() method that can widen the encoding range to this character '\u0329'.

I think that the error may have something to do with the fact that I'm using the nameparser module to organize all of the counselors' names into the same formats. The HumanName function imported from nameparser might write out the counselors' names with a leading 'u' to signify unicode, meaning that the total output u'Sam the Man' instead of 'Sam the Man' is not recognized.

Thanks for the help!

ERROR following amendment based on answer:

  File "/private/var/folders/jv/9_sy0bn10mbdft1bk9t14qz40000gn/T/Cleanup At Startup/merge-395782963.700.py", line 153, in <module>
    row['name'] = row['name'].encode('utf-8')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xcc in position 11: ordinal not in range(128)

Code that makes all of the name entries uniform:

# nbcc
with(open('/Users/samuelfinegold/Documents/noodle/gc/nbcc/nbcc_output.txt', 'rU')) as nbcc:
    nbcc_d = csv.DictReader(nbcc, delimiter = '\t')
    nbcc_l = []
    for row in nbcc_d:
#         name = HumanName(row['name'])
#         row['name'] = name.title + ' ' + name.first + ' ' + name.middle + ' ' + name.last + ' ' + name.suffix       
        row['phone'] = row['phone'].translate(None, whitespace + punctuation)
        nbcc_l.append(row)

Amended code:

# compile master spreadsheet
with(open('gc_all.txt_3','w')) as gc_all:
    gc_all_d = csv.DictWriter(gc_all,  fieldnames = fieldnames, extrasaction='ignore', delimiter = '\t') 
    gc_all_d.writeheader()
    for row in nbcc_l:
        row['name'] = row['name'].encode('utf-8')
        gc_all_d.writerow(row)

Error:

Traceback (most recent call last):
  File "/private/var/folders/jv/9_sy0bn10mbdft1bk9t14qz40000gn/T/Cleanup At Startup/merge-395784700.086.py", line 153, in <module>
    row['name'] = row['name'].encode('utf-8')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xcc in position 11: ordinal not in range(128)
logout

possible duplicate of [Parsing csv file with english and hindi characters in python](http://stackoverflow.com/questions/17661093/parsing-csv-file-with-english-and-hindi-characters-in-python) — abarnert, Jul 17 '13 at 19:32
It's also probably a dup of half the Related questions both here and on the target. — abarnert, Jul 17 '13 at 19:32

Peter DeGlopper · Accepted Answer · 2013-07-17T20:22:14.587

4

From the docs:

This version of the csv module doesn’t support Unicode input. Also, there are currently some issues regarding ASCII NUL characters. Accordingly, all input should be UTF-8 or printable ASCII to be safe; see the examples in section Examples.

You'll need to encode your data before writing it - something like:

for row in aicep_1:
    print row['name']
    for key, value in row.iteritems():
        row[key] = value.encode('utf-8')
    gc_all_d.writerow(row)

Or, since you're on 2.7, you can use a dictionary comprehension:

for row in aicep_1:
    print row['name']
    row = {key, value.encode('utf-8') for key, value in row.iteritems()}

Or use some of the more sophisticated patterns on the examples page in the docs.

edited Jul 17 '13 at 20:22

answered Jul 17 '13 at 19:28

Peter DeGlopper

36,326
7
90
83

File "/private/var/folders/jv/9_sy0bn10mbdft1bk9t14qz40000gn/T/Cleanup At Startup/merge-395782963.700.py", line 153, in row['name'] = row['name'].encode('utf-8') UnicodeDecodeError: 'ascii' codec can't decode byte 0xcc in position 11: ordinal not in range(128) – goldisfine Jul 17 '13 at 19:46
What's the full context? That could happen if you have something already encoded and then try to re-encode it, as described here: http://stackoverflow.com/questions/9644099/python-ascii-codec-cant-decode-byte - so it depends on where you're getting your rows. – Peter DeGlopper Jul 17 '13 at 19:52
Um, the full context...so I run this: for row in nbcc_l: # print type(row) row['name'] = row['name'].encode('utf-8') gc_all_d.writerow(row and all of the rows are type = 'unicode'. So I'm not sure the error would be the same. – goldisfine Jul 17 '13 at 20:15
Comments don't handle code well - the best way to show that would be to edit it into the post. Eyeballing it, though, that's not quite the right solution unless `row` has only the one element. And I see that I overlooked in my answer that you're using a `DictWriter` - I'll edit it to correct that. – Peter DeGlopper Jul 17 '13 at 20:18
Think that I'm already doing that. 'name' is the key for all of the altered values. – goldisfine Jul 17 '13 at 20:25
Are all of your rows created through the same `DictReader` logic you've added to the question? `DictReader` results are utf-8 encoded, or more accurately they're exactly what's in your file, so you don't need to re-encode them and in fact will get an exception if you try unless they're pure ASCII. But your original post shows that you have Unicode strings somewhere. – Peter DeGlopper Jul 17 '13 at 20:29
When I use the nameparser module to make all the names uniform, I think it changes the value of the 'name' key to type unicode rather than string. I use the same logic to create all of the rows, and I've edited the question to show the section where I'm getting the error. The weird thing is that the error only occurs when I writerow() from the NBCC data source. The others are okay, despite the fact that I write the dictionaries in the same way. – goldisfine Jul 17 '13 at 20:45
Well, I am not familiar with your name parser module so it's possible that `HumanName` returns a unicode object even when given string input. In which case, you need to review your code to make sure that you're encoding it exactly once - that exception is what you get when you call `encode` on the a string containing the utf-8 encoding of the Unicode code point `\u0329`. As for why different data sources work differently - `encode` tries to decode if given a string argument, using the default codec (ascii, for your environment). Possibly only one of your sources has non-ascii names. – Peter DeGlopper Jul 17 '13 at 20:52
Ah, I was not encoding it only once! Thanks so much Peter for being so charitable with your time! help on!! – goldisfine Jul 17 '13 at 20:57

score 2 · Answer 2 · answered Jul 17 '13 at 19:36

What you have is an output stream (your gc_all.txt_3 file, opened on the with line, stream instance in variable gc_all) that Python believes must hold nothing but ASCII. You've asked it to write a Unicode string with the Unicode character '\u0329'. For instance:

>>> s = u"foo\u0329bar"
>>> with open('/tmp/unicode.txt', 'w') as stream: stream.write(s)
...

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\u0329' in position 3:
ordinal not in range(128)

You have a bunch of options, including doing an explicit .encode on each string. Or, you can open the file with codecs.open as described in http://docs.python.org/2/howto/unicode.html (I'm assuming Python 2.x, 3.x is a little different):

>>> import codecs
>>> with codecs.open('/tmp/unicode.txt', 'w', encoding='utf-8') as stream:
...     stream.write(s)
... 
>>>

Edit to add: based on @Peter DeGlopper's answer, explicit encode may be safer. UTF-8 has no NULs in its encoding so assuming you want UTF-8, and usually one does, this may be OK.

How to avoid unicodeError?

2 Answers2