encode 'UCS-2 Little Endian' file to 'utf8' using python error

Question

I'm trying to encode from UCS-2 Little Endian file to utf8 using python and I'm getting a weird error.

The code I'm using:

file=open("C:/AAS01.txt", 'r', encoding='utf8')
lines = file.readlines()
file.close()

And I'm getting the following error:

Traceback (most recent call last):
  File "C:/Users/PycharmProjects/test.py", line 18, in <module>
    main()
  File "C:/Users/PycharmProjects/test.py", line 7, in main
    lines = file.readlines()
  File "C:\Python34\lib\codecs.py", line 319, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

I tried to use codecs commands, but also didn't work... Any idea what I can do?

score 5 · Answer 1 · answered Jul 29 '17 at 20:44

5

The encoding argument to open sets the input encoding. Use encoding='utf_16_le'.

answered Jul 29 '17 at 20:44

Phil Krylov

531
4
8

score 4 · Accepted Answer · answered Jul 29 '17 at 20:42

If you're trying to read UCS-2, why are you telling Python it's UTF-8? The 0xff is most likely the first byte of a little endian byte order marker:

>>> codecs.BOM_UTF16_LE
b'\xff\xfe'

UCS-2 is also deprecated, for the simple reason that Unicode outgrew it. The typical replacement would be UTF-16.

More info linked in Python 3: reading UCS-2 (BE) file

encode 'UCS-2 Little Endian' file to 'utf8' using python error

2 Answers2