3

I have a csv file saved encoded as UTF-8.

It contains non-ascii chars [umlauts].

I am reading the file using:

csv.DictReader(<file>,delimiter=<delimiter>).

My questions are:

  1. In which encoding is the file being read?
  2. I noticed that in order to refer to the strings as utf-8 I need to perform:

    str.decode('utf-8')
    

    Is there a better approach then reading the file in one encoding and then to convert to another, i.e. utf-8?

[Python version: 2.7]

sorin
  • 161,544
  • 178
  • 535
  • 806
Maoritzio
  • 1,142
  • 2
  • 13
  • 31
  • This answer solved my problem: https://stackoverflow.com/questions/5004687/python-csv-dictreader-with-utf-8-data – ThomasW Nov 29 '17 at 03:12

2 Answers2

2

In Python 2.7, the CSV module does not apply any decoding - it opens the file in binary mode and returns bytes strings.

Use https://github.com/jdunck/python-unicodecsv, which decodes on the fly.

Use it like:

with open("myfile.csv", 'rb') as my_file:    
    r = unicodecsv.DictReader(my_file, encoding='utf-8')

r will contain a dict of Unicodes. It's important that the source file is opened as binary mode.

Alastair McCormack
  • 26,573
  • 8
  • 77
  • 100
1

How about using instances and classes in order to achieve this?

You can store the shared dictionary at the class level and also make it load Unicode text files, and even detect their encoding, with or without use of BOM file masks.

Long time ago I wrote a simple library which overrides the default open() with one that is Unicode aware.

If you do import tendo.unicode you will be able to change the way csv library loads the files too.

If your files do not have a BOM header the library will assume UTF-8 instead of the old ascii. You can even specify another fallback encoding if you want.

sorin
  • 161,544
  • 178
  • 535
  • 806