0

io.open is supposed to be stripping preambles when opening files in various encodings.

For instance, the following file encoded with UTF-8-SIG has the preamble stripped correctly before reading it into a string:

(Note: I'm not opening these files in binary mode. The first line of these logs is to demonstrate the contents of the files that are about to be read.)

# Raw binary, so you can see that it's a proper UTF-8-SIG encoded file
import io; io.open(csv_file_path, 'br').readline()
'\xef\xbb\xbf"EventId","Rate","Attribute1","Attribute2","(\xef\xbd\xa1\xef\xbd\xa5\xcf\x89\xef\xbd\xa5\xef\xbd\xa1)\xef\xbe\x89"\r\n'

# Open file with encoding specified
import io; io.open(csv_file_path, encoding='UTF-8-SIG').readline()
u'"EventId","Rate","Attribute1","Attribute2","(\uff61\uff65\u03c9\uff65\uff61)\uff89"\n'

But while this file with a UTF-16LE encoding is being successfully opened, the preamble is coming with it:

# Raw binary, so you can see that it's a proper UTF-16LE encoded file
import io; io.open(csv_file_path, 'br').readline()
'\xff\xfe"\x00E\x00v\x00e\x00n\x00t\x00I\x00d\x00"\x00,\x00"\x00R\x00a\x00t\x00e\x00"\x00,\x00"\x00A\x00t\x00t\x00r\x00i\x00b\x00u\x00t\x00e\x001\x00"\x00,\x00"\x00A\x00t\x00t\x00r\x00i\x00b\x00u\x00t\x00e\x002\x00"\x00,\x00"\x00(\x00a\xffe\xff\xc9\x03e\xffa\xff)\x00\x89\xff"\x00\r\x00\n'

# Open file with encoding specified
import io; io.open(csv_file_path, encoding='UTF-16LE').readline()
u'\ufeff"EventId","Rate","Attribute1","Attribute2","(\uff61\uff65\u03c9\uff65\uff61)\uff89"\n'

This goes on to break file validation that expects the file contents to start right off with "EventId"...

Am I opening this file incorrectly?

Note that I'm not satisfied having to manually strip out preambles after opening the file - I want to support arbitrary encodings and I expect io.open (with the correct encoding supplied, as determined by chardet) to abstract away the need for me to have a bunch of hard coded preambles to skip if encountered at the beginning of the first line.

Alain
  • 26,663
  • 20
  • 114
  • 184
  • "*io.open is supposed to be stripping preambles when opening files in various encodings.*" - What makes you believe this? I cannot find any statement in the documentation to support this. – Robᵩ Oct 17 '14 at 23:17

1 Answers1

2

According to this answer, you need to use UTF-16, not UTF-16LE.

io.open(csv_file_path, encoding='UTF-16').readline()
Community
  • 1
  • 1
Robᵩ
  • 163,533
  • 20
  • 239
  • 308
  • Hmm, this is a problem for me because I don't even look at the encoding. I just take whatever [chardet](https://pypi.python.org/pypi/chardet) spits out and use that to decode the file. Chardet is spitting out `UTF-16LE` (and `UTF-16BE` for another one that's failing similarily) – Alain Oct 18 '14 at 00:41
  • But hey, this is fine. I just added `if 'UTF-16' in detected: detected = 'UTF-16'` and that solved both cases, so I don't mind throwing that dirty little hack in there. Thanks! – Alain Oct 18 '14 at 00:55