Reading UTF-8 with BOM using Python CSV module causes unwanted extra characters

Question

I am trying to read a CSV file with Python with the following code:

with open("example.txt") as f:
   c = csv.reader(f)
   for row in c:
      print row

My example.txt has only the following content:

Hello world!

For UTF-8 or ANSI encoded files, this gives me the expected output:

> ["Hello world!"]

But if I save the file as UTF-8 with BOM I get this output:

> ["\xef\xbb\xbfHello world!"]

Since I do not have any control over what files the user will use as input, I would like this to work with BOM as well. How can I fix this problem? Is there anything I need to do to ensure that this works for other encodings as well?

NB: whatever solution you use, the important thing is to use `utf-8-sig` for decoding. — ekhumoro, Nov 18 '15 at 16:44
`import csv,csvkit,codecs,unicodecsv with open("example.txt",'r') as f: c = csv.reader(f) for row in c: print [unicode(s, "utf-8") for s in row] with open("example.txt",'r') as f: c = unicodecsv.reader(f) for row in c: print row with open("example.txt",'r') as f: c = csvkit.reader(f) for row in c: print row` all prints `[u'\ufeffHello world!']` so i ithink it is not **duplicate**- first try is using http://stackoverflow.com/questions/17245415/read-and-write-csv-files-including-unicode-with-python-2-7 — Learner, Nov 18 '15 at 17:04
@ekhumoro: The duplicate is border line... Other question is about UTF-8 data while this one is specifically about BOM in utf8 file. The other page only speaks (in only one answer) of BOM for UTF-16 files. Your comment does answer this question but IMHO it would deserve to be an answer on a not duplicate question :-) — Serge Ballesta, Nov 18 '15 at 17:06
@SergeBallesta. Please read the question more carefully (esp. the last paragraph) - it's not only about the utf-8 signature. Also, the highest voted answer in the dup specifically uses `utf-8-sig`; but some of the other answers don't - which is why I added a comment here. — ekhumoro, Nov 18 '15 at 17:20

Martin Evans · Answer 1 · 2019-09-07T08:24:28.353

6

You could make use of the unicodecsv Python module as follows:

import unicodecsv

with open('input.csv', 'rb') as f_input:
    csv_reader = unicodecsv.reader(f_input, encoding='utf-8-sig')
    print list(csv_reader)

So for an input file containing the following in UTF-8 with BOM:

c1,c2,c3,c4,c5,c6,c7,c8
1,2,3,4,5,6,7,8

It would display the following:

[[u'c1', u'c2', u'c3', u'c4', u'c5', u'c6', u'c7', u'c8'], [u'1', u'2', u'3', u'4', u'5', u'6', u'7', u'8']]

The unicodecsv module can be installed using pip as follows:

pip install unicodecsv

edited Sep 07 '19 at 08:24

answered Nov 18 '15 at 16:48

Martin Evans

45,791
17
81
97

but what about `\ufeff`? is not it useless? – Learner Nov 18 '15 at 17:12
3

Indeed, I'd put the wrong encoding in, as stated `utf-8-sig` should be used. – Martin Evans Nov 18 '15 at 17:25

Reading UTF-8 with BOM using Python CSV module causes unwanted extra characters

1 Answers1