I am using Python 3.7.
I have data being read from two files. Both contain UTF-8 data (well, technically...). One is properly "encoded" into UTF-8 while the other was written as a decoded bytestream.
# 'ba' is the on-disk form of the first (correctly-encoded) file
>>> ba = b'\xc3\xb6'
>>> ba
b'\xc3\xb6'
>>> ba.decode()
'ö'
# 'bb' is the on-disk form of the incorrectly-encoded second file
>>> bb = b'\xf6'
>>> bb
b'\xf6'
# 'bs' is the unicode version of the same byte value as bb
>>> bs = '\xf6'
>>> bs
'ö'
# If I try to decode ba, I get the correct value.
>>> ba.decode()
'ö'
>>> ba.decode() == bs
True
# But if I try to decode bb, I get an encoding error.
>>> bb.decode() == bs
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf6 in position 0: invalid start byte
How do I turn bb
from an incorrectly-encoded bytes
object into a correctly-decoded str
object? I am okay with making an assumption that bb
is UTF-8 decoded data blob instead of garbage.
I have already done a little bit of searching to try to solve it myself. It's honestly rather difficult to filter the Python2.x cruft from Google results though.
I did find this answer helpful since it mentions the "unicode-escape" encoding which appears to do what I want it to do:
>>> bb.decode('unicode-escape')
'ö'
>>> bb.decode('unicode-escape') == bs
True
However, it is not clear to me what side effects that 'unicode-escape' might present outside of this specific scenario. It seems to indicate that it reads escaped encodings, but I don't believe b'\xf6'
is escaped: I believe it is a single byte represented as hex to the python interpreter. I would assume that an escaped encoding would be something like b'\\xf6'
, which should be four bytes: ASCII backslash, ASCII 'x', ASCII 'f', and ASCII '6'.
>>> bc
b'\\xf6'
>>> bc.decode()
'\\xf6'
>>> bc == bb
False
>>> bc.decode('unicode-escape')
'ö'
To be clear, I do not want such escaping to be processed! I would want the UTF version of bc
to be equal to '\\xf6'
whose len
is 4.
Edit1: A user seems to think this is a duplicate question. I have read the other question and do not see how this is a duplicate. The other question talks about latin-1. I do not believe I am dealing with latin-1 data.