Python CSV DictReader with UTF-8 data

Question

AFAIK, the Python (v2.6) csv module can't handle unicode data by default, correct? In the Python docs there's an example on how to read from a UTF-8 encoded file. But this example only returns the CSV rows as a list. I'd like to access the row columns by name as it is done by csv.DictReader but with UTF-8 encoded CSV input file.

Can anyone tell me how to do this in an efficient way? I will have to process CSV files in 100's of MByte in size.

score 54 · Accepted Answer · edited Jun 14 '17 at 22:32

54

I came up with an answer myself:

def UnicodeDictReader(utf8_data, **kwargs):
    csv_reader = csv.DictReader(utf8_data, **kwargs)
    for row in csv_reader:
        yield {unicode(key, 'utf-8'):unicode(value, 'utf-8') for key, value in row.iteritems()}

_{Note: This has been updated so keys are decoded per the suggestion in the comments}

edited Jun 14 '17 at 22:32

Alan W. Smith

24,647
4
70
96

answered Feb 15 '11 at 15:24

LMatter

957
1
7
8

1

choose it as the answer then :) – Uku Loskit Feb 15 '11 at 15:29
ok, didn't know I can do that :) I will wait some time to see if someone knows a better way, then accept it. – LMatter Feb 15 '11 at 15:47
9

-1 This doesn't decode the dictionary keys in the first row of the file. – John Machin Mar 30 '11 at 09:06
You can actually save a few characters by removing the list comprehension brackets inside the dict constructor. That makes it a generator, and inside function calls w/ one argument the parentheses of generators are optional. :-P – Josh Tauberer May 26 '12 at 01:04
Be aware that it doesn't handle csv.DictReader restkey flag functionality – Ivan Klass Dec 17 '13 at 11:54
3

No need to apologize for answering your own question. That's one of the intended uses of stackoverflow. Now everyone else can share what you taught yourself! – dinosaur Jun 01 '16 at 22:19
3

As John Machin mentioned, this will not decode the keys; the yield line should be: yield {unicode(key, 'utf-8'):unicode(value, 'utf-8') for key, value in row.iteritems()} – Giacomo Jul 04 '16 at 19:35

score 38 · Answer 2 · answered Mar 10 '19 at 20:29

38

For me, the key was not in manipulating the csv DictReader args, but the file opener itself. This did the trick:

with open(filepath, mode="r", encoding="utf-8-sig") as csv_file:
    csv_reader = csv.DictReader(csv_file)

No special class required. Now I can open files either with or without BOM without crashing.

answered Mar 10 '19 at 20:29

shacker

14,712
8
89
89

1

TypeError: 'encoding' is an invalid keyword argument for this function – ATX Sep 05 '20 at 18:26
@ATX Odd - I wonder if you were on python2 rather than 3? – shacker May 05 '21 at 21:29
Yes indeed it was p2 – ATX Jul 29 '21 at 09:38

score 1 · Answer 3 · answered Jan 22 '18 at 13:29

A classed based approach to @LMatter answer, with this approach you still get all the benefits of DictReader such as getting the fieldnames and getting the line number plus it handles UTF-8

import csv

class UnicodeDictReader(csv.DictReader, object):

    def next(self):
        row = super(UnicodeDictReader, self).next()
        return {unicode(key, 'utf-8'): unicode(value, 'utf-8') for key, value in row.iteritems()}

score 1 · Answer 4 · edited May 23 '17 at 12:25

1

First of all, use the 2.6 version of the documentation. It can change for each release. It says clearly that it doesn't support Unicode but it does support UTF-8. Technically, these are not the same thing. As the docs say:

The csv module doesn’t directly support reading and writing Unicode, but it is 8-bit-clean save for some problems with ASCII NUL characters. So you can write functions or classes that handle the encoding and decoding for you as long as you avoid encodings like UTF-16 that use NULs. UTF-8 is recommended.

The example below (from the docs) shows how to create two functions that correctly read text as UTF-8 as CSV. You should know that csv.reader() always returns a DictReader object.

import csv

def unicode_csv_reader(unicode_csv_data, dialect=csv.excel, **kwargs):
    # csv.py doesn't do Unicode; encode temporarily as UTF-8:
    csv_reader = csv.DictReader(utf_8_encoder(unicode_csv_data),
                            dialect=dialect, **kwargs)
    for row in csv_reader:
        # decode UTF-8 back to Unicode, cell by cell:
        yield [unicode(cell, 'utf-8') for cell in row]

edited May 23 '17 at 12:25

Community

1
1

answered Feb 15 '11 at 14:28

kelloti

8,705
5
46
82

2

`csv.reader()` does not return a DictReader object when I test it. Are you sure about that? Also the yield statement in your example just returns a list with the values only and not a dict. – LMatter Feb 15 '11 at 15:05
I guess you're right about the DictReader. I changed the example to invoke `csv.DictReader` instead of `csv.reader`. Note that other than this difference, this is directly out of the documentation. – kelloti Feb 15 '11 at 15:13
1

I think your reader still does not return a dict but just a list of the row values (see the yield statement). But thanks for your answer, anyway, after re-reading the documentation that you mentioned, I came up with a solution myself (finally :)) – LMatter Feb 15 '11 at 15:26
@LMatter do you mind sharing that solution you found? – Matthias Dec 11 '15 at 16:20

score 1 · Answer 5 · answered May 28 '19 at 17:43

1

That's easy with the unicodecsv package.

# pip install unicodecsv
import unicodecsv as csv

with open('your_file.csv') as csvfile:
    reader = csv.DictReader(csvfile)
    for row in reader:
        print(row)

answered May 28 '19 at 17:43

mrvol

2,575
18
21

score 0 · Answer 6 · answered Jun 21 '18 at 15:39

The csvw package has other functionality as well (for metadata-enriched CSV for the Web), but it defines a UnicodeDictReader class wrapping around its UnicodeReader class, which at its core does exactly that:

class UnicodeReader(Iterator):
    """Read Unicode data from a csv file."""
    […]

    def _next_row(self):
        self.lineno += 1
        return [
            s if isinstance(s, text_type) else s.decode(self._reader_encoding)
            for s in next(self.reader)]

It did catch me off a few times, but csvw.UnicodeDictReader really, really needs to be used in a with block and breaks otherwise. Other than that, the module is nicely generic and compatible with both py2 and py3.

score 0 · Answer 7 · answered Aug 23 '18 at 07:26

The answer doesn't have the DictWriter methods, so here is the updated class:

class DictUnicodeWriter(object):

    def __init__(self, f, fieldnames, dialect=csv.excel, encoding="utf-8", **kwds):
        self.fieldnames = fieldnames    # list of keys for the dict
        # Redirect output to a queue
        self.queue = cStringIO.StringIO()
        self.writer = csv.DictWriter(self.queue, fieldnames, dialect=dialect, **kwds)
        self.stream = f
        self.encoder = codecs.getincrementalencoder(encoding)()

    def writerow(self, row):
        self.writer.writerow({k: v.encode("utf-8") for k, v in row.items()})
        # Fetch UTF-8 output from the queue ...
        data = self.queue.getvalue()
        data = data.decode("utf-8")
        # ... and reencode it into the target encoding
        data = self.encoder.encode(data)
        # write to the target stream
        self.stream.write(data)
        # empty queue
        self.queue.truncate(0)

    def writerows(self, rows):
        for row in rows:
            self.writerow(row)

    def writeheader(self):
        header = dict(zip(self.fieldnames, self.fieldnames))
        self.writerow(header)

Python CSV DictReader with UTF-8 data

7 Answers7

Linked

Related