Fix a unicode string broken by (some) escaped backslashes

Question

I was looking at this question: python3 replacing double backslash with single backslash [duplicate]

and sifting through the responses to similar questions: Python Replace \ with \ , Why can't Python's raw string literals end with a single backslash? , How do I unescape a unicode escaped string in python?

When I realised that none of the answers really solve this problem. Say I have a broken unicode string, it contains both escaped backslashes and escape characters:

my_str = '\\xa5\\xc0\\xe6aK\xf9\\x80\\xb1\\xc8*\x01\x12$\\xfbp\x1e(4\\xd6{;Z'

When I print it, some characters evaluate:

print(my_str)
\xa5\xc0\xe6aKù\x80\xb1\xc8*☺↕$\xfbp▲(4\xd6{;Z

I can manually fix it like this:

my_str = repr(my_str)
my_str
"'\\\\xa5\\\\xc0\\\\xe6aKù\\\\x80\\\\xb1\\\\xc8*\\x01\\x12$\\\\xfbp\\x1e(4\\\\xd6{;Z'"
my_str = my_str.replace('\\\\','\\')
print(my_str)
'\xa5\xc0\xe6aKù\x80\xb1\xc8*\x01\x12$\xfbp\x1e(4\xd6{;Z'

But at this point I have to manually copy and paste the result of print into a variable to finish the fix:

my_str = '\xa5\xc0\xe6aKù\x80\xb1\xc8*\x01\x12$\xfbp\x1e(4\xd6{;Z'
print(my_str)
¥ÀæaKù±È*☺↕$ûp▲(4Ö{;Z

How do I do this without copying and pasting?

Jean-François Fabre · Accepted Answer · 2018-05-06T14:21:52.973

2

strip off the single quotes, encode to get bytes, then decode using "unicode-escape":

# original code
my_str = '\\xa5\\xc0\\xe6aK\xf9\\x80\\xb1\\xc8*\x01\x12$\\xfbp\x1e(4\\xd6{;Z'
my_str = repr(my_str)
my_str = my_str.replace('\\\\','\\')
print(my_str)
# encode/decode stuff
print(my_str.strip("'").encode().decode("unicode-escape"))

prints:

'\xa5\xc0\xe6aKù\x80\xb1\xc8*\x01\x12$\xfbp\x1e(4\xd6{;Z'
¥ÀæaKÃ¹±È*$ûp(4Ö{;Z

edited May 06 '18 at 14:21

answered May 06 '18 at 14:03

Jean-François Fabre

137,073
23
153
219

Ah, that makes sense.. I was wondering why it was different. I also found another answer that works (using `ast.literal_eval`) here: https://stackoverflow.com/questions/24886123/reverse-repr-function-in-python – Zhenhir May 06 '18 at 14:10
1

`my_str.strip("'").encode().decode("unicode-escape")` throws me a `UnicodeDecodeError` though. – Sean Francis N. Ballais May 06 '18 at 14:14
Ah, that would be because a ù snuck in there somehow.. happens in python 2.7. literal_eval works in 2.7 though. – Zhenhir May 06 '18 at 14:18
1

it depends on your terminal too. I pasted OP output string. I shouldn't have done that :) I have edited to start from OP original input – Jean-François Fabre May 06 '18 at 14:19
1

ast.literal_eval gives a slightly different result – Jean-François Fabre May 06 '18 at 14:22
literal_eval gives a weird result (Ñ└µaK∙Ç▒╚*☺↕$√p▲(4╓{;Z) in 2.7, where encode/decode doesn't work. It still doesn't work when I set `-*- coding: utf-8 -*-` and literal_eval is still weird. The differences in 3 seem to be at the ù character that's causing trouble. – Zhenhir May 06 '18 at 14:37

score 0 · Answer 2 · answered May 06 '18 at 14:43

I've mentioned ast.literal_eval in the comments on the accepted answer. But feel I should include a code snippet here:

Reverse repr function in Python

from ast import literal_eval

my_str = '\\xa5\\xc0\\xe6aK\xf9\\x80\\xb1\\xc8*\x01\x12$\\xfbp\x1e(4\\xd6{;Z'
my_str = repr(my_str)
my_str = my_str.replace('\\\\','\\')
print(literal_eval(my_str))

Result (Python 3):

¥ÀæaKù±È*☺↕$ûp▲(4Ö{;Z

Fix a unicode string broken by (some) escaped backslashes

2 Answers2