How can I convert literal escape sequences in a string to the corresponding bytes?

Question

I have a UTF-8 encoded string that comes from somewhere else that contains the characters \xc3\x85lesund (literal backslash, literal "x", literal "c", etc).

Printing it outputs the following:

\xc3\x85lesund

I want to convert it to a bytes variable:

b'\xc3\x85lesund'

To be able to encode:

'Ålesund'

How can I do this? I'm using python 3.4.

Are you trying to do something like this? [Process escape sequences in a string in Python](http://stackoverflow.com/q/4020539/176646) — ThisSuitIsBlackNot, Jan 09 '17 at 16:55
`s=u'\xc3\x85lesund'` then `bytearray(s, 'Latin-1')` or `bytearray(s, 'ISO-8859-1')` — Bill Bell, Jan 09 '17 at 17:03
@ThisSuitIsBlackNot Exactly, sadly using the accepted answer some info is lost as the 2nd post states. It returns `Ãlesund`. Going to try the 2nd post approach with codecs. — Rafael Almeida, Jan 09 '17 at 17:06
@BillBell Doesn't work, it creates the byte array `bytearray(b'\\xc3\\x85lesund')` — Rafael Almeida, Jan 09 '17 at 17:07
@ThisSuitIsBlackNot The 2nd poster's answer also didn't work yielded the same `Ãlesund` — Rafael Almeida, Jan 09 '17 at 17:12
The [third solution](http://stackoverflow.com/a/37059682/176646) works for me (although the method it uses is undocumented). — ThisSuitIsBlackNot, Jan 09 '17 at 17:24
@ThisSuitIsBlackNot Eureka! It does work, awesome. Post it as an answer and I'll accept it — Rafael Almeida, Jan 09 '17 at 17:30

score 7 · Accepted Answer · edited May 23 '17 at 12:01

Using `unicode_escape`

TL;DR You can decode bytes using the unicode_escape encoding to convert \xXX and \uXXXX escape sequences to the corresponding characters:

>>> r'\xc3\x85lesund'.encode('utf-8').decode('unicode_escape').encode('latin-1')
b'\xc3\x85lesund'

First, encode the string to bytes so it can be decoded:

>>> r'\xc3\x85あ'.encode('utf-8')
b'\\xc3\\x85\xe3\x81\x82'

(I changed the string to show that this process works even for characters outside of Latin-1.)

Here's how each character is encoded (note that あ is encoded into multiple bytes):

\ (U+005C) -> 0x5c
x (U+0078) -> 0x78
c (U+0063) -> 0x63
3 (U+0033) -> 0x33
\ (U+005C) -> 0x5c
x (U+0078) -> 0x78
8 (U+0038) -> 0x38
5 (U+0035) -> 0x35
あ (U+3042) -> 0xe3, 0x81, 0x82

Next, decode the bytes as unicode_escape to replace each escape sequence with its corresponding character:

>>> r'\xc3\x85あ'.encode('utf-8').decode('unicode_escape')
'Ã\x85ã\x81\x82'

Each escape sequence is converted to a separate character; each byte that is not part of an escape sequence is converted to the character with the corresponding ordinal value:

\\xc3 -> U+00C3
\\x85 -> U+0085
\xe3 -> U+00E3
\x81 -> U+0081
\x82 -> U+0082

Finally, encode the string to bytes again:

>>> r'\xc3\x85あ'.encode('utf-8').decode('unicode_escape').encode('latin-1')
b'\xc3\x85\xe3\x81\x82'

Encoding as Latin-1 simply converts each character to its ordinal value:

U+00C3 -> 0xc3
U+0085 -> 0x85
U+00E3 -> 0xe3
U+0081 -> 0x81
U+0082 -> 0x82

And voilà, we have the byte sequence you're looking for.

Using `codecs.escape_decode`

As an alternative, you can use the codecs.escape_decode method to interpret escape sequences in a bytes to bytes conversion, as user19087 posted in an answer to a similar question:

>>> import codecs
>>> codecs.escape_decode(r'\xc3\x85lesund'.encode('utf-8'))[0]
b'\xc3\x85lesund'

However, codecs.escape_decode is undocumented, so I wouldn't recommend using it.

Both solutions worked wonderfully, if it's not asking too much can you give me the reasoning behind `.encode('latin-1')`? — Rafael Almeida, Jan 11 '17 at 11:54
@RafaelAlmeida I added a more detailed explanation. Sorry that took so long! — ThisSuitIsBlackNot, Jan 12 '17 at 22:58

How can I convert literal escape sequences in a string to the corresponding bytes?

1 Answers1

Using unicode_escape

Using codecs.escape_decode

Using `unicode_escape`

Using `codecs.escape_decode`