Trouble with UTF-8 CSV input in Python

Question

This seems like it should be an easy fix, but so far a solution has eluded me. I have a single column csv file with non-ascii chars saved in utf-8 that I want to read in and store in a list. I'm attempting to follow the principle of the "Unicode Sandwich" and decode upon reading the file in:

import codecs
import csv

with codecs.open('utf8file.csv', 'rU', encoding='utf-8') as file:
input_file = csv.reader(file, delimiter=",", quotechar='|')
list = []
for row in input_file:
    list.extend(row)

This produces the dread 'codec can't encode characters in position, ordinal not in range(128)' error.

I've also tried adapting a solution from this answer, which returns a similar error

def unicode_csv_reader(utf8_data, dialect=csv.excel, **kwargs):
    csv_reader = csv.reader(utf8_data, dialect=dialect, **kwargs)
    for row in csv_reader:
        yield [unicode(cell, 'utf-8') for cell in row]

filename = 'inputs\encode.csv'
reader = unicode_csv_reader(open(filename))
target_list = []
for field1 in reader:
    target_list.extend(field1)

A very similar solution adapted from the docs returns the same error.

def unicode_csv_reader(utf8_data, dialect=csv.excel):
    csv_reader = csv.reader(utf_8_encoder(utf8_data), dialect)
    for row in csv_reader:
        yield [unicode(cell, 'utf-8') for cell in row]

def utf_8_encoder(unicode_csv_data):
    for line in unicode_csv_data:
    yield line.encode('utf-8')

filename = 'inputs\encode.csv'
reader = unicode_csv_reader(open(filename))
target_list = []
for field1 in reader:
    target_list.extend(field1)

Clearly I'm missing something. Most of the questions that I've seen regarding this problem seem to predate Python 2.7, so an update here might be useful.

The error message suggests it could be trying to decode the data as ASCII... — Lev Levitsky, May 22 '12 at 21:38
Your second example works for me, maybe you have broken csv module, I suggest you upgrade your python? Or maybe the error is elsewhere — Antti Haapala -- Слава Україні, May 22 '12 at 21:40
It sounds like your original CSV file isn't encoded as UTF-8 - can you confirm it is? Maybe it's UTF-16LE or something, or some other language-specific coding. You can use the chardet module to detect encoding. — Ansari, May 22 '12 at 21:42
@Ansari it is my understanding that it's impossible to reliably detect encoding based on only the file. I re-saved a copy of the file in utf-8 just now, retested, and got the same result so for our purposes here we can rule that out. — acpigeon, May 22 '12 at 21:50
@AnttiHaapala My version of Python came from a clean Windows install a few weeks ago, version is 2.7.3. — acpigeon, May 22 '12 at 21:52
@LevLevitsky that's certain possible, the full error message is "UnicodeEncodeError: 'ascii' codec can't encode characters in position 1-2: ordinal not in range(128)". Does that give you any more insight? — acpigeon, May 22 '12 at 21:54
Fair enough - why don't you pinpoint the character at which it is failing? Print out some information about which lines are processed and see which line and character it is failing at. When I had this error it was because of a few stray characters that were mis-encoded. — Ansari, May 22 '12 at 21:55
@acpigeon: actually detecting UTF-8 is very easy. If you read the whole file you should be able to decode it as UTF-8, as in open('encode.csv').read().decode('UTF-8'). If it does not fail then with almost 100.0 % certainty it can be said that it is either plain ASCII, UTF-8, or ASCII characters as UTF-16, or UTF-32 ;) — Antti Haapala -- Слава Україні, May 22 '12 at 22:00
Print statements fail at the first occurrence of a non-ascii character. In my sample file, this would be 'Ú'. — acpigeon, May 22 '12 at 22:01
But this kind of error means that you are converting an unicode object to a str — Antti Haapala -- Слава Україні, May 22 '12 at 22:01
Ah, so you're not really making a sandwich. You get Unicode, but at some point Python tries to silently convert it to ASCII. — Lev Levitsky, May 22 '12 at 22:01
print does that always :D and that surely is discussed all the time in stackoverflow — Antti Haapala -- Слава Україні, May 22 '12 at 22:02
@acpigeon you never said you were printing it. Encode it as utf-8, and it will be printed right. — Lev Levitsky, May 22 '12 at 22:03
Not printing in any of my implementations, just to test a point @Ansari made. Correction, the test print statements work up until seeing a non-ascii char, at which point the error is triggered. Sorry! — acpigeon, May 22 '12 at 22:05
That sounds right, unless you do something like `print unistr.encode('utf-8')`. Apart from that, could you show the full traceback for the code in the question? — Lev Levitsky, May 22 '12 at 22:11
File "encoding.py", line 53, in for row in input_file: UnicodeEncodeError: 'ascii' codec can't encode characters in position 1-2: ordinal not in range(128) — acpigeon, May 22 '12 at 22:18
actually, have you tried it without doing anything special at all? just `csv.reader(the_file)`? — katy lavallee, May 22 '12 at 22:28
That error message is from your FIRST code snippet, which is guaranteed not to work. Also it's at line 53 in your file -- you are NOT showing all your code (bad idea). Show the full traceback and error message from your SECOND snippet. — John Machin, May 22 '12 at 22:30
Have you tried to [specify the `b` flag in `open`](http://docs.python.org/library/csv.html#module-contents)? — Lev Levitsky, May 22 '12 at 22:32
Sorry John, trying to isolate the problem code snippet within a larger script so I've commented out the irrelevant chunks. Tried moving just the second snippet to a test file and the error seems to go away. I'm going to try rebuilding my original script piece by piece and will update with the results. — acpigeon, May 22 '12 at 22:51

score 18 · Accepted Answer · answered May 22 '12 at 22:52

18

Your first snippet won't work. You are feeding unicode data to the csv reader, which (as documented) can't handle it.

Your 2nd and 3rd snippets are confused. Something like the following is all that you need:

f = open('your_utf8_encoded_file.csv', 'rb')
reader = csv.reader(f)
for utf8_row in reader:
    unicode_row = [x.decode('utf8') for x in utf8_row]
    print unicode_row

answered May 22 '12 at 22:52

John Machin

81,303
11
141
189

This works. Not sure exactly what in my original script was causing the problem, but such is life. Thanks. – acpigeon May 22 '12 at 23:20

score 12 · Answer 2 · answered May 22 '12 at 22:07

12

At it fails from the first char to read, you may have a BOM. Use codecs.open('utf8file.csv', 'rU', encoding='utf-8-sig') if your file is UTF8 and has a BOM at the beginning.

answered May 22 '12 at 22:07

Zeugma

31,231
9
69
81

score -2 · Answer 3 · answered May 22 '12 at 21:40

-2

I'd suggest trying just:

input_file = csv.reader(open('utf8file.csv', 'r'), delimiter=",", quotechar='|')

or

input_file = csv.reader(open('utf8file.csv', 'rb'), delimiter=",", quotechar='|')

csv should be unicode aware, and it should just work.

answered May 22 '12 at 21:40

Clarus

2,259
16
27

it is specifically NOT unicode aware, however neither does your example use unicode. – Antti Haapala -- Слава Україні May 22 '12 at 21:41
there is no such thing as "utf-8 aware" – Antti Haapala -- Слава Україні May 22 '12 at 21:48

Trouble with UTF-8 CSV input in Python

3 Answers3