Convert "\x" escaped string into readable string in python

Question

Is there a way to convert a \x escaped string like "\\xe8\\xaa\\x9e\\xe8\\xa8\\x80" into readable form: "語言"?

>>> a = "\\xe8\\xaa\\x9e\\xe8\\xa8\\x80"
>>> print(a)
\xe8\xaa\x9e\xe8\xa8\x80

I am aware that there is a similar question here, but it seems the solution is only for latin characters. How can I convert this form of string into readable CJK characters?

score 14 · Accepted Answer · answered Aug 02 '20 at 17:25

14

Decode it first using 'unicode-escape', then as 'utf8':

a = "\\xe8\\xaa\\x9e\\xe8\\xa8\\x80"
    
decoded = a.encode('latin1').decode('unicode_escape').encode('latin1').decode('utf8')
print(decoded)

# 語言

Note that since we can only decode bytes objects, we need to transparently encode it in between, using 'latin1'.

answered Aug 02 '20 at 17:25

Thierry Lathuille

23,663
10
44
50

I am curious why we need to encode it using latin1, it seems it won't work if I encode it with utf-8: `a.encode('utf8').decode('unicode-escape').encode('utf8').decode('utf8')` – oeter Aug 02 '20 at 17:40
1

Because you have a 1 to 1 correspondence between bytes and characters in latin1, while some characters are represented by up to 4 bytes when encoded in utf8, and some bytes sequences don't represent a character or are invalid. – Thierry Lathuille Aug 02 '20 at 17:44

score 3 · Answer 2 · answered Aug 02 '20 at 17:54

3

Starting with string a which appears to follow python's hex escaping rules, you can decode it to a bytes object plus length of string decoded.

>>> a = "\\xe8\\xaa\\x9e\\xe8\\xa8\\x80"
>>> import codecs
>>> codecs.escape_decode(a)
(b'\xe8\xaa\x9e\xe8\xa8\x80', 24)

You don't need the length here, so just get item 0. Now its time for some guessing. Assuming that this string actually represented a utf-8 encoding, you now have a bytes array that you can decode

>>> codecs.escape_decode(a)[0].decode('utf-8')
'語言'

If the underlying encoding was different (say, a Windows CJK code page), you'd have to decode with its decoder.

answered Aug 02 '20 at 17:54

tdelaney

73,364
6
83
116

Thank you, it works:) Btw, could you explain a little bit how `codecs.escape_decode()` works under the hood? I've been searching for doc about this function but couldn't find any, `help(escape_decode)` doesn't give me any information either. – oeter Aug 03 '20 at 12:12
I _think_ that `escape_decode` turns around and calls the parser's literal string decode logic. If the string were `a = "\xe8\xaa\x9e\xe8\xa8\x80"` (single backslashes) the literal string parser would make the bytes you want. By calling `escape_decode` you are essentially calling the literal string parser a second time. I didn't realize it has no docs. – tdelaney Aug 03 '20 at 13:51

score 0 · Answer 3 · answered Aug 05 '22 at 01:49

0

Text like this could make a valid Python bytes literal. Assuming we don't have to worry about invalid input, we can simply construct a string that looks like the corresponding source code, and use ast.literal_eval to interpret it that way (this is safe, unlike using eval). Finally we decode the resulting bytes as UTF-8. Thus:

>>> a = "\\xe8\\xaa\\x9e\\xe8\\xa8\\x80"
>>> ast.literal_eval(f"b'{a}'")
b'\xe8\xaa\x9e\xe8\xa8\x80'
>>> ast.literal_eval(f"b'{a}'").decode('utf-8')
'語言'

answered Aug 05 '22 at 01:49

Karl Knechtel

62,466
11
102
153

Modeled on @Barmar's answer at https://stackoverflow.com/a/68751612/523612 . I decided that this version of the question should be the canonical, but it was missing this approach. – Karl Knechtel Aug 05 '22 at 01:51

score 0 · Answer 4 · answered Aug 05 '22 at 01:57

0

Such a codec is missing in stdlib. My package all-escapes registers a codec which can be used:

>>> a = "\\xe8\\xaa\\x9e\\xe8\\xa8\\x80"
>>> a.encode('all-escapes').decode()
'語言'

answered Aug 05 '22 at 01:57

wim

338,267
99
616
750

Convert "\x" escaped string into readable string in python

4 Answers4

Linked

Related