2

I'm having some issues with a Python script that needs to open files with different encoding.

I'm usually using this:

with open(path_to_file, 'r') as f:
    first_line = f.readline()

And that works great when the file is properly encode.

But sometimes, it doesn't work, for example with this file, I've got this:

In [22]: with codecs.open(filename, 'r') as f:
    ...:    a = f.readline()
    ...:    print(a)
    ...:    print(repr(a))
    ...:     
��Test for StackOverlow

'\xff\xfeT\x00e\x00s\x00t\x00 \x00f\x00o\x00r\x00 \x00S\x00t\x00a\x00c\x00k\x00O\x00v\x00e\x00r\x00l\x00o\x00w\x00\r\x00\n'

And I would like to search some stuff on those lines. Sadly with that method, I can't:

In [24]: "Test" in a
Out[24]: False

I've found a lot of questions here referring to the same type of issues:

  1. Unicode (UTF-8) reading and writing to files in Python
  2. UnicodeDecodeError: 'utf8' codec can't decode byte 0xa5 in position 0: invalid start byte
  3. https://softwareengineering.stackexchange.com/questions/187169/how-to-detect-the-encoding-of-a-file
  4. how can i escape '\xff\xfe' to a readable string

But can't manage to decode the file properly with them...

With codecs.open():

In [17]: with codecs.open(filename, 'r', "utf-8") as f:
    a = f.readline()
    print(a)
   ....:     
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-17-0e72208eaac2> in <module>()
      1 with codecs.open(filename, 'r', "utf-8") as f:
----> 2     a = f.readline()
      3     print(a)
      4 

/usr/lib/python2.7/codecs.pyc in readline(self, size)
    688     def readline(self, size=None):
    689 
--> 690         return self.reader.readline(size)
    691 
    692     def readlines(self, sizehint=None):

/usr/lib/python2.7/codecs.pyc in readline(self, size, keepends)
    543         # If size is given, we call read() only once
    544         while True:
--> 545             data = self.read(readsize, firstline=True)
    546             if data:
    547                 # If we're at a "\r" read one extra character (which might

/usr/lib/python2.7/codecs.pyc in read(self, size, chars, firstline)
    490             data = self.bytebuffer + newdata
    491             try:
--> 492                 newchars, decodedbytes = self.decode(data, self.errors)
    493             except UnicodeDecodeError, exc:
    494                 if firstline:

UnicodeDecodeError: 'utf8' codec can't decode byte 0xff in position 0: invalid start byte

with encode('utf-8):

In [18]: with codecs.open(filename, 'r') as f:
    a = f.readline()
    print(a)
   ....:     a.encode('utf-8')
   ....:     print(a)
   ....:     
��Test for StackOverlow

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-18-7facc05b9cb1> in <module>()
      2     a = f.readline()
      3     print(a)
----> 4     a.encode('utf-8')
      5     print(a)
      6 

UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 0: ordinal not in range(128)

I've found a way to change file encoding automatically with Vim:

system("vim '+set fileencoding=utf-8' '+wq' %s" % path_to_file)

But I would like to do this without using Vim...

Any help will be appreciate.

Community
  • 1
  • 1
Xavier C.
  • 1,921
  • 15
  • 22
  • there are some ways to try and autodetect encoding... see https://pypi.python.org/pypi/chardet ... but its risky ... chances are if its not "utf8" then its "latin1" – Joran Beasley Sep 13 '16 at 17:10
  • 4
    No, this is definitely UTF-16 little-endian. A UTF-8 BOM would be \xEF\xBB\xBF. This file's BOM and conspicuous pattern of null bytes indicates UTF-16 little-endian. – user2357112 Sep 13 '16 at 17:13
  • @JoranBeasley, thanks for your help, but I'm getting `UnicodeDecodeError: 'utf8' codec can't decode byte 0xff in position 0: invalid start byte` with `a.decode("utf-8-sig")` – Xavier C. Sep 13 '16 at 17:17
  • yeah i was wrong ... see @user2357112's comment – Joran Beasley Sep 13 '16 at 17:18
  • @user2357112 with both `a.decode("utf-16-le")` and `a.decode("utf-16")` I'm getting: `UnicodeDecodeError: 'utf16' codec can't decode byte 0x0a in position 46: truncated data` – Xavier C. Sep 13 '16 at 17:21
  • 1
    @XavierC.: Yeah, it looks like this file got corrupted somehow. It's missing a trailing null byte. – user2357112 Sep 13 '16 at 17:25
  • @user2357112 but then how every text editor I've tried always succeed to open it? – Xavier C. Sep 13 '16 at 17:28
  • @XavierC.: Most likely, the text editor is doing its best to show you something reasonable instead of just spitting out an error message. – user2357112 Sep 13 '16 at 17:41
  • 1
    Oh, wait, you tried to `decode` the line instead of specifying a codec in `codecs.open`. That's why it's complaining about truncated data. You need to specify the codec when you open the file. – user2357112 Sep 13 '16 at 18:28
  • @user2357112 exact... with `with codecs.open(filename2, 'r', 'utf-16-le') as f:` it's perfect! – Xavier C. Sep 13 '16 at 18:32

2 Answers2

6

it looks like this is utf-16-le (utf-16 little endian ...) but you are missing a final \x00

>>> s = '\xff\xfeT\x00e\x00s\x00t\x00 \x00f\x00o\x00r\x00 \x00S\x00t\x00a\x00c\x
00k\x00O\x00v\x00e\x00r\x00l\x00o\x00w\x00\r\x00\n'
>>> s.decode('utf-16-le') # creates error
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python26\lib\encodings\utf_16_le.py", line 16, in decode
    return codecs.utf_16_le_decode(input, errors, True)
UnicodeDecodeError: 'utf16' codec can't decode byte 0x0a in position 46: truncat
ed data
>>> (s+"\x00").decode("utf-16-le") # TADA!!!!
u'\ufeffTest for StackOverlow\r\n'
>>>
Joran Beasley
  • 110,522
  • 12
  • 160
  • 179
5

It looks like you need to detect the encoding in the input file. The chardet library mentioned in the answer to this question might help (though note the proviso that complete encoding detection is not possible).

Then you can write the file out in a known encoding, perhaps. When dealing with Unicode remember that it MUST be encoded into a suitable bytestream before being communicated outside the process. Decode on input, then encode on output.

Community
  • 1
  • 1
holdenweb
  • 33,305
  • 7
  • 57
  • 77
  • 2
    actually adressing the OP's stated question +1 – Joran Beasley Sep 13 '16 at 17:26
  • @holdenweb, I agree with JoranBeasley when he says that you answer to my real question, how to find the encoding, but as he solves my issue with the input file example, I will accept his answer. Eventually, thanks for your help. – Xavier C. Sep 13 '16 at 18:05
  • Sure, but JoranBeasley gave you the UTF-16 you need for your specific data set. – holdenweb Sep 14 '16 at 08:28