python module like csv-DictReader with full utf8 support

Question

I need import data from a csv in my project and i need a object like DictReader, but with full utf8 supports, anyone knows a module or app with this?

Your file appears to be encoded in `cp1252`, not `UTF-8` (see my answer) ... please respond. — John Machin, Mar 30 '11 at 19:45
Your right, the encoding wasn't utf8, but when i try your code, i got an error about keys because the keys are the second data row, not the first. — diegueus9, Apr 01 '11 at 17:44
The csv DictReader expects the keys to be the FIRST row in the file. Your "sampleresults.csv" file on `dropbox` has the keys in the FIRST row. If you have another file with junk in the first row, and keys in the second row, of course the outcome would be sub-optimal. Please show (0) What version of Python you are using, on what platform (1) what code you used to call my code, and (2) EXACTLY what the "error about keys" was. See also my updated answer. — John Machin, Apr 01 '11 at 20:01
nevermind, i change my code, i think the error was iterate twice on file. — diegueus9, Apr 12 '11 at 20:43

John Machin · Accepted Answer · 2011-04-01T20:10:51.287

Your data is NOT encoded in UTF-8. It is (mostly) encoded in cp1252. The data appears to include Spanish names. The most prevalent non-ASCII character is '\xd1` (i.e. Latin capital letter N with tilde) -- this is the character that caused the exception.

One of the non-ASCII characters in the file is '\x8d'. It is NOT in cp1252. It appears where the letter A should appear in the name VASQUEZ. Of the others, '\x94' (curly double quote in cp1252) appears in the middle of a name. The remaining ones may also represent errors.

I suggest that you run this little code fragment to print lines with suspicious characters in them:

for lino, line in enumerate(open('sampleresults.csv')):
    if any(c in line for c in '\x8d\x94\xc1\xcf\xd3'): print "%d %r\n" % (lino+1, line)

and fix up the data.

Then you need a csv DictReader with full and generalised decoding support. Full means decoding the fieldnames aka dict keys as well as the data. Generalised means no hardcoding of the encoding.

import csv

def UnicodeDictReader(str_data, encoding, **kwargs):
    csv_reader = csv.DictReader(str_data, **kwargs)
    # Decode the keys once
    keymap = dict((k, k.decode(encoding)) for k in csv_reader.fieldnames)
    for row in csv_reader:
        yield dict((keymap[k], v.decode(encoding)) for k, v in row.iteritems())

dozedata = ['\xd1,\xff', '\xd2,\xfe', '3,4']
print list(UnicodeDictReader(dozedata, 'cp1252'))

Output:

[{u'\xd1': u'\xd2', u'\xff': u'\xfe'}, {u'\xd1': u'3', u'\xff': u'4'}]

and here is what you get with your sample file (first data row only, Python 2.7.1, Windows 7):

>>> import csv
>>> from pprint import pprint as pp
>>> def UnicodeDictReader(str_data, encoding, **kwargs):
...     csv_reader = csv.DictReader(str_data, **kwargs)
...     # Decode the keys once
...     keymap = dict((k, k.decode(encoding)) for k in csv_reader.fieldnames)
...     for row in csv_reader:
...         yield dict((keymap[k], v.decode(encoding)) for k, v in row.iteritems())
...
>>> f = open('sampleresults.csv', 'rb')
>>> drdr = UnicodeDictReader(f, 'cp1252')
>>> pp(drdr.next())
{u'APELLIDO': u'=== family names redacted ===',
 u'CATEGORIA': u'ABIERTA',
 u'CEDULA': u'10000640',
 u'DELAY': u' 0:20',
 u'EDAD': u'25',
 u'EMAIL': u'mimail640',
 u'NO.': u'640',
 u'NOMBRE': u'=== given names redacted ===',
 u'POSICION CATEGORIA': u'1',
 u'POSICION CATEGORIA EN KM.5': u'11',
 u'POSICION GENERAL CHIP': u'1',
 u'POSICION GENERAL EN KM.5': u'34',
 u'POSICION GENERAL GUN': u'1',
 u'POSICION GENERO': u'1',
 u'PRIMEROS 5KM.': u'0:32:55',
 u'PROMEDIO/KM.': u' 5:44',
 u'SEGUNDOS KM.': u'0:24:05',
 u'SEX': u'M',
 u'TIEMPO CHIP': u'0:56:59',
 u'TIEMPO GUN': u'0:57:19'}
>>>

I have a CSV saved from Excel. I tried encodings: cp1252 (Windows), utf8. And I use your example of `UnicodeDictReader` to read the data in CSV as `dict`. After the data is parsed, I put it into `Jinja2` template which uses only `utf8`. And I found that Arabic text is decoded wrong because I get only question marks in the rendered template instead Arabic. I use [this](http://pastebin.com/Wtij70Vi) code. — boldnik, May 30 '14 at 11:37

score 0 · Answer 2 · edited May 23 '17 at 12:28

0

As the answer to this post said :

def UnicodeDictReader(utf8_data, **kwargs):
    csv_reader = csv.DictReader(utf8_data, **kwargs)
    for row in csv_reader:
        yield dict([(key, unicode(value, 'utf-8')) for key, value in row.iteritems()])

You can see below my example code. I'm using your csv file (see comments).

import csv

def UnicodeDictReader(utf8_data, **kwargs):
    csv_reader = csv.DictReader(utf8_data, **kwargs)
    for row in csv_reader:
        yield dict([(key, unicode(value, 'utf-8')) for key, value in row.iteritems()])

f = open('sampleresults.csv', 'r')
a = UnicodeDictReader(f)
for i in a:
    if i['NOMBRE'] == 'GUIDO ALEJANDRO':
        print i['APELLIDO']

Ouput:

MUÑOZ RENGIFO

You can see that the 'Ñ' is correctly encoded.

edited May 23 '17 at 12:28

Community

1
1

answered Mar 29 '11 at 20:47

Sandro Munda

39,921
24
98
123

doesn't works, i got this: 'utf8' codec can't decode byte 0xd1 in position 2: invalid continuation byte in the yield line – diegueus9 Mar 29 '11 at 21:02
Can you provide your csv file ? – Sandro Munda Mar 29 '11 at 21:02
HELLO HELLO this can't work -- his file is definitely NOT encoded in UTF-8 (see my answer); how do you explain the error message that he got? – John Machin Mar 30 '11 at 07:40
-1 **AND** like the answer that you copied, it doesn't decode the keys (fieldnames) in the first row of the file. – John Machin Mar 30 '11 at 09:48

python module like csv-DictReader with full utf8 support

2 Answers2

Linked