0

I am using Python 3.7.

I have data being read from two files. Both contain UTF-8 data (well, technically...). One is properly "encoded" into UTF-8 while the other was written as a decoded bytestream.

# 'ba' is the on-disk form of the first (correctly-encoded) file
>>> ba = b'\xc3\xb6'
>>> ba
b'\xc3\xb6'
>>> ba.decode()
'ö'

# 'bb' is the on-disk form of the incorrectly-encoded second file
>>> bb = b'\xf6'
>>> bb
b'\xf6'

# 'bs' is the unicode version of the same byte value as bb
>>> bs = '\xf6'
>>> bs
'ö'

# If I try to decode ba, I get the correct value.
>>> ba.decode()
'ö'
>>> ba.decode() == bs
True

# But if I try to decode bb, I get an encoding error.
>>> bb.decode() == bs
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf6 in position 0: invalid start byte

How do I turn bb from an incorrectly-encoded bytes object into a correctly-decoded str object? I am okay with making an assumption that bb is UTF-8 decoded data blob instead of garbage.

I have already done a little bit of searching to try to solve it myself. It's honestly rather difficult to filter the Python2.x cruft from Google results though.

I did find this answer helpful since it mentions the "unicode-escape" encoding which appears to do what I want it to do:

>>> bb.decode('unicode-escape')
'ö'
>>> bb.decode('unicode-escape') == bs
True

However, it is not clear to me what side effects that 'unicode-escape' might present outside of this specific scenario. It seems to indicate that it reads escaped encodings, but I don't believe b'\xf6' is escaped: I believe it is a single byte represented as hex to the python interpreter. I would assume that an escaped encoding would be something like b'\\xf6', which should be four bytes: ASCII backslash, ASCII 'x', ASCII 'f', and ASCII '6'.

>>> bc
b'\\xf6'
>>> bc.decode()
'\\xf6'
>>> bc == bb
False
>>> bc.decode('unicode-escape')
'ö'

To be clear, I do not want such escaping to be processed! I would want the UTF version of bc to be equal to '\\xf6' whose len is 4.

Edit1: A user seems to think this is a duplicate question. I have read the other question and do not see how this is a duplicate. The other question talks about latin-1. I do not believe I am dealing with latin-1 data.

inetknght
  • 4,300
  • 1
  • 26
  • 52

0 Answers0