2

I've got a simple CSV file I get via email and I tried parsing it.

test.csv containing only:

"Time Interval","SubId","Space Id","Space","Imps.","eCPM (€)","Profit"
"2015-11-15","bottomunit","59457","foo.com","9362","1.92","17.97"

The .py is simple enough (PY3.4.2):

import csv

with open('test.csv') as f:
    reader = csv.DictReader(f)
    for row in reader:
        print(row)

and the output is "messed up":

{'Profit': '17.97', 'Imps.': '9362', '"Time Interval"': '2015-11-15', 'Space': 'foo.com', 'eCPM (€)': '1.92', 'Space Id': '59457', 'SubId': 'bottomunit'}

To be more specific, the Time Interval fieldname is for some reason not parsed correctly but parsed as "Time Interval". Unfortunately this editor does not show it, but there are 3 additional ASCII chars in front of `"Time Inverval"'

As ASCII the whole "Time Inverval" looks like this: 239, 187, 191, 34, 84, 105, 109, 101, 32, 73, 110, 116, 101, 114, 118, 97, 108, 34

I've already checked the CSV with Notepad++, there's nothing I can tell of in front of the first entry and I have no idea why it is not parsed correctly.

I also tried:

  • replacing the first entry with "date"
  • removing all spaces
  • removing all special characters
  • adding delimiter=',' and quotechar='"' to the DictReader
  • tried Python 3.5.0

Problem persists, the parsed first entry is preceeded with ASCII 239, 187, 191 and in double quotes.

rikaidekinai
  • 304
  • 2
  • 10
  • I tried reproducing this on Python 3.4.3 (Gentoo Linux) and it worked correctly. Any chance you can try it with an updated Python interpreter? Perhaps it's a bug in that specific version. _[ninja edit]_ or alternatively: are you _sure_ the CSV file doesn't contain those extra three bytes at the beginning? Check in a hex editor to be sure, in case characters don't show up in Notepad++ for some reason. – David Z Nov 18 '15 at 12:28
  • @DavidZ You're very right, viewed HEX it very much starts with `EF,BB,BF` -.- The CSV is attached to the email compessed in a .ZIP and I use zipfile.ZipFile to extract it prior. I now suspected this to cause the issue, but also if I just unzip the file with 7zip its "malformed" with the 3 additional bytes. Guess I got to contact the folks generating the CSV reports =) – rikaidekinai Nov 18 '15 at 12:45

1 Answers1

3

239, 187, 191 is in hexadecimal 0xEF,0xBB,0xBF that is the utf8 representation of the Byte Order Mark (or BOM).

It can happen in a text file, and almost all normal text editor know about it and just assume that :

  • it should not be displayed in any way
  • following text should be UTF8
Serge Ballesta
  • 143,923
  • 11
  • 122
  • 252
  • 1
    Thanks, that's what I found out now. It seems it's an everlasting issue in Python looking inthe bugtracker and another stackoverflow post seems to have the fix: http://stackoverflow.com/questions/20899939/removing-bom-from-gziped-csv-in-python – rikaidekinai Nov 18 '15 at 12:50