Using unicode_escape
TL;DR You can decode bytes using the unicode_escape
encoding to convert \xXX
and \uXXXX
escape sequences to the corresponding characters:
>>> r'\xc3\x85lesund'.encode('utf-8').decode('unicode_escape').encode('latin-1')
b'\xc3\x85lesund'
First, encode the string to bytes so it can be decoded:
>>> r'\xc3\x85あ'.encode('utf-8')
b'\\xc3\\x85\xe3\x81\x82'
(I changed the string to show that this process works even for characters outside of Latin-1.)
Here's how each character is encoded (note that あ is encoded into multiple bytes):
\
(U+005C) -> 0x5c
x
(U+0078) -> 0x78
c
(U+0063) -> 0x63
3
(U+0033) -> 0x33
\
(U+005C) -> 0x5c
x
(U+0078) -> 0x78
8
(U+0038) -> 0x38
5
(U+0035) -> 0x35
あ
(U+3042) -> 0xe3, 0x81, 0x82
Next, decode the bytes as unicode_escape
to replace each escape sequence with its corresponding character:
>>> r'\xc3\x85あ'.encode('utf-8').decode('unicode_escape')
'Ã\x85ã\x81\x82'
Each escape sequence is converted to a separate character; each byte that is not part of an escape sequence is converted to the character with the corresponding ordinal value:
\\xc3
-> U+00C3
\\x85
-> U+0085
\xe3
-> U+00E3
\x81
-> U+0081
\x82
-> U+0082
Finally, encode the string to bytes again:
>>> r'\xc3\x85あ'.encode('utf-8').decode('unicode_escape').encode('latin-1')
b'\xc3\x85\xe3\x81\x82'
Encoding as Latin-1 simply converts each character to its ordinal value:
- U+00C3 -> 0xc3
- U+0085 -> 0x85
- U+00E3 -> 0xe3
- U+0081 -> 0x81
- U+0082 -> 0x82
And voilà, we have the byte sequence you're looking for.
Using codecs.escape_decode
As an alternative, you can use the codecs.escape_decode
method to interpret escape sequences in a bytes to bytes conversion, as user19087 posted in an answer to a similar question:
>>> import codecs
>>> codecs.escape_decode(r'\xc3\x85lesund'.encode('utf-8'))[0]
b'\xc3\x85lesund'
However, codecs.escape_decode
is undocumented, so I wouldn't recommend using it.