13

I am trying to read a CSV file with Python with the following code:

with open("example.txt") as f:
   c = csv.reader(f)
   for row in c:
      print row

My example.txt has only the following content:

Hello world!

For UTF-8 or ANSI encoded files, this gives me the expected output:

> ["Hello world!"]

But if I save the file as UTF-8 with BOM I get this output:

> ["\xef\xbb\xbfHello world!"]

Since I do not have any control over what files the user will use as input, I would like this to work with BOM as well. How can I fix this problem? Is there anything I need to do to ensure that this works for other encodings as well?

Anders
  • 8,307
  • 9
  • 56
  • 88
  • 2
    NB: whatever solution you use, the important thing is to use `utf-8-sig` for decoding. – ekhumoro Nov 18 '15 at 16:44
  • `import csv,csvkit,codecs,unicodecsv with open("example.txt",'r') as f: c = csv.reader(f) for row in c: print [unicode(s, "utf-8") for s in row] with open("example.txt",'r') as f: c = unicodecsv.reader(f) for row in c: print row with open("example.txt",'r') as f: c = csvkit.reader(f) for row in c: print row` all prints `[u'\ufeffHello world!']` so i ithink it is not **duplicate**- first try is using http://stackoverflow.com/questions/17245415/read-and-write-csv-files-including-unicode-with-python-2-7 – Learner Nov 18 '15 at 17:04
  • @ekhumoro: The duplicate is border line... Other question is about UTF-8 data while this one is specifically about BOM in utf8 file. The other page only speaks (in only one answer) of BOM for UTF-16 files. Your comment does answer this question but IMHO it would deserve to be an answer on a not duplicate question :-) – Serge Ballesta Nov 18 '15 at 17:06
  • @SergeBallesta. Please read the question more carefully (esp. the last paragraph) - it's not only about the utf-8 signature. Also, the highest voted answer in the dup specifically uses `utf-8-sig`; but some of the other answers don't - which is why I added a comment here. – ekhumoro Nov 18 '15 at 17:20

1 Answers1

6

You could make use of the unicodecsv Python module as follows:

import unicodecsv

with open('input.csv', 'rb') as f_input:
    csv_reader = unicodecsv.reader(f_input, encoding='utf-8-sig')
    print list(csv_reader)

So for an input file containing the following in UTF-8 with BOM:

c1,c2,c3,c4,c5,c6,c7,c8
1,2,3,4,5,6,7,8

It would display the following:

[[u'c1', u'c2', u'c3', u'c4', u'c5', u'c6', u'c7', u'c8'], [u'1', u'2', u'3', u'4', u'5', u'6', u'7', u'8']]

The unicodecsv module can be installed using pip as follows:

pip install unicodecsv
Martin Evans
  • 45,791
  • 17
  • 81
  • 97