Convert utf-8 unicode sequence to utf-8 chars in Python 3

Question

I'm reading data from an aws s3 bucket which happens to have unicode chars escaped with double backslashes.

The double backslashes makes the unicode sequence parsed as a series of utf-8 characters instead of the character which the unicode represents.

The example illustrates the situation.

>>> s1="1+1\\u003d2"
>>> print(s1)
1+1\u003d2

The actual unicode sequence would in this case an equal sign.

>>> s2="1+1\u003d2"
>>> print(s2)
1+1=2

Is there a way to convert the sequence of utf-8 character in the first example so that the string represented by s1 is parsed with it's unicode sequence as the actual utf-8 sign that it represents?

juanpa.arrivillaga · Accepted Answer · 2019-02-26T23:13:06.330

5

I believe that the codecs module provides this utility:

>>> import codecs
>>> codecs.decode("1+1\\u003d2", encoding='unicode_escape')
'1+1=2'

This probably points to a larger problem, though. How do these strings come to be in the first place?

Note, if this is being extracted from a valid JSON string (in this case it would be missing the quotes), you could simply use:

>>> import json
>>> json.loads('"1+1\\u003d2"')
'1+1=2'

edited Feb 26 '19 at 23:13

answered Feb 26 '19 at 22:57

juanpa.arrivillaga

88,713
10
131
172

I thought that `decode` required you to start with a byte string. I just tried it though and it worked either way... interesting. – Mark Ransom Feb 26 '19 at 23:43
@MarkRansom yeah I think it is a vestige of the old API. At least, it *returns* a `str`. They should add these "Unicode decode" functions in a separate namespace – juanpa.arrivillaga Feb 27 '19 at 00:57
Thanks for clarifying the json.loads, it also works with surrogate sequences – Leonard Saers Feb 28 '19 at 04:09

score 0 · Answer 2 · answered Feb 28 '19 at 04:07

I'm also adding a variant of juanpa.arrivillaga solution which also handles surrogate escape.

>>> import codecs
>>> s1="A surrogate sequence \\ud808\\udf45"
>>> print(codecs.decode(s1, encoding='unicode_escape'))
Traceback (most recent call last):
 File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode characters in position 21-22: surrogates not allowed
>>> print(codecs.decode(s1,encoding='unicode_escape',errors='surrogateescape').encode('utf-16', 'surrogatepass').decode('utf-16'))
A surrogate sequence

Convert utf-8 unicode sequence to utf-8 chars in Python 3

2 Answers2

Linked