7

I got this error:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position: 0, invalid start byte

I found this solution:

>>> b"abcde".decode("utf-8")

from here: Convert bytes to a Python string

But how do you use it if a) you don’t know where the 0xff is and/or b) you need to decode a file object? What is the correct syntax / format?

I am parsing through a directory, so I tried going through the files one at a time. (NOTE: This won't work when the project gets larger!!!)

>>> i = "b'0xff'"
>>> with open('firstfile') as f:
...     g=f.readlines()
... 
>>> i in g
False
>>> 0xff in g
False
>>> '0xff' in g
False
>>> b'0xff' in g
False

>>> with open('secondfile') as f:
<snip - same process>

>>> with open('thirdfile') as f:
...     g = f.readlines()
... 
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "/usr/local/lib/python3.4/codecs.py", line 313, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

So if this is the right file, and if I can't open it with Python (I put it in sublime text, found nothing) how do I decode, or encode, this? Thanks.

Community
  • 1
  • 1
Malik A. Rumi
  • 1,855
  • 4
  • 25
  • 36

3 Answers3

5

You have a number of problems:

  • i = "b'0xff'" Creates a string of 7 bytes, not a single 0xFF byte. i = b'\xff' or i = bytes([0xff]) is the correct method.

  • open defaults to decoding files using the encoding returned by local.getpreferredencoding(False). Open in binary mode to get raw un-decoded bytes: open('firstfile','rb').

  • g=f.readlines() returns a list of lines. i in g checks for an exact match of the content of i to the content of a line in the line list.

  • Use meaningful variable names!

Instead:

byte = b'\xff'
with open('firstfile','rb') as f:
    file_content = f.read()
if byte in file_content:
   ...

To decode a file, you need to know it's correct encoding and provide it when you open the file:

with open('firstfile',encoding='utf8') as f:
    file_content = f.read()

If you don't know the encoding, the 3rd party chardet module can help you guess.

Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251
  • I'd give 1+ for chardet. This post http://stackoverflow.com/questions/1461907/html-encoding-issues-%C3%82-character-showing-up-instead-of-nbsp was also very helpful. – Malik A. Rumi Feb 25 '17 at 02:26
5
#how to decode byte 0xff in python

As we know this is hexadecimal encoding so , utf-8 , codec and other decoders are not able to decode this byte into string. Here we will use 'UTF-16' or 'utf-16' encoding to decode the 0xff byte array into string or ASCII character.

Let me help you understand this:

st = "this world is very beautiful"
print(st.encode('utf-16'))
>>>b'\xff\xfet\x00h\x00i\x00s\x00 \x00w\x00o\x00r\x00l\x00d\x00 \x00i\x00s\x00 \x00v\x00e\x00r\x00y\x00 \x00b\x00e\x00a\x00u\x00t\x00i\x00f\x00u\x00l\x00'

Again we want to convert it into simple ASCII characters. There are two method by which we can decode a 0xff code to simple string.

st = b'\xff\xfet\x00h\x00i\x00s\x00 \x00w\x00o\x00r\x00l\x00d\x00 \x00i\x00s\x00 \x00v\x00e\x00r\x00y\x00 \x00b\x00e\x00a\x00u\x00t\x00i\x00f\x00u\x00l\x00'  

First is:

print(str(st, "utf-16"))

Second is:

print(st.decode('UTF-16'))   

We will get the string as output:

>>>'this world is very beautiful'
bsplosion
  • 2,641
  • 27
  • 38
Priyansh gupta
  • 906
  • 12
  • 10
0

The easiest way is to use try/except to catch the UnicodeDecodeError, then you know that's the file where you have the error.

Most likely that file was not encoded in UTF-8. In this case you might want to read the file in as either binary:

with open('thirdfile', 'rb') as f:
    g = f.readlines()

How to figure out the file's encoding is a different problem.

David Metcalfe
  • 2,237
  • 1
  • 31
  • 44
dragonx
  • 14,963
  • 27
  • 44