Issue with Python 3.4.2 and csv.DictReader parsing the first fieldname incorrectly

Question

I've got a simple CSV file I get via email and I tried parsing it.

test.csv containing only:

"Time Interval","SubId","Space Id","Space","Imps.","eCPM (€)","Profit"
"2015-11-15","bottomunit","59457","foo.com","9362","1.92","17.97"

The .py is simple enough (PY3.4.2):

import csv

with open('test.csv') as f:
    reader = csv.DictReader(f)
    for row in reader:
        print(row)

and the output is "messed up":

{'Profit': '17.97', 'Imps.': '9362', '"Time Interval"': '2015-11-15', 'Space': 'foo.com', 'eCPM (€)': '1.92', 'Space Id': '59457', 'SubId': 'bottomunit'}

To be more specific, the Time Interval fieldname is for some reason not parsed correctly but parsed as "Time Interval". Unfortunately this editor does not show it, but there are 3 additional ASCII chars in front of `"Time Inverval"'

As ASCII the whole "Time Inverval" looks like this: 239, 187, 191, 34, 84, 105, 109, 101, 32, 73, 110, 116, 101, 114, 118, 97, 108, 34

I've already checked the CSV with Notepad++, there's nothing I can tell of in front of the first entry and I have no idea why it is not parsed correctly.

I also tried:

replacing the first entry with "date"
removing all spaces
removing all special characters
adding delimiter=',' and quotechar='"' to the DictReader
tried Python 3.5.0

Problem persists, the parsed first entry is preceeded with ASCII 239, 187, 191 and in double quotes.

I tried reproducing this on Python 3.4.3 (Gentoo Linux) and it worked correctly. Any chance you can try it with an updated Python interpreter? Perhaps it's a bug in that specific version. _[ninja edit]_ or alternatively: are you _sure_ the CSV file doesn't contain those extra three bytes at the beginning? Check in a hex editor to be sure, in case characters don't show up in Notepad++ for some reason. — David Z, Nov 18 '15 at 12:28
@DavidZ You're very right, viewed HEX it very much starts with `EF,BB,BF` -.- The CSV is attached to the email compessed in a .ZIP and I use zipfile.ZipFile to extract it prior. I now suspected this to cause the issue, but also if I just unzip the file with 7zip its "malformed" with the 3 additional bytes. Guess I got to contact the folks generating the CSV reports =) — rikaidekinai, Nov 18 '15 at 12:45

score 3 · Accepted Answer · answered Nov 18 '15 at 12:46

3

239, 187, 191 is in hexadecimal 0xEF,0xBB,0xBF that is the utf8 representation of the Byte Order Mark (or BOM).

It can happen in a text file, and almost all normal text editor know about it and just assume that :

it should not be displayed in any way
following text should be UTF8

answered Nov 18 '15 at 12:46

Serge Ballesta

143,923
11
122
252

1

Thanks, that's what I found out now. It seems it's an everlasting issue in Python looking inthe bugtracker and another stackoverflow post seems to have the fix: http://stackoverflow.com/questions/20899939/removing-bom-from-gziped-csv-in-python – rikaidekinai Nov 18 '15 at 12:50

Issue with Python 3.4.2 and csv.DictReader parsing the first fieldname incorrectly

1 Answers1