In python, how to decode strings whose literal content is in utf-8?

Question

I was trying to make an add-on for Anki which imports the opml notes from Mubu, and I the contents that I needed were stored in a str object like the one below, and I was not able to decode them or convert them into byte objects.

"\x3Cspan\x3E\xE6\x88\x91\xE5\x8F\x91\xE7\x8E\xB0\xE6\x88\x91\xE5\xB1\x85\xE7\x84\xB6\xE6\xB2\xA1\xE6\x9C\x89\xE6\xB5\x8B\xE8\xAF\x95\xE8\xBF\x87\xE4\xB8\xAD\xE6\x96\x87\xEF\xBC\x8C\xE8\xBF\x99\xE4\xB8\xAA\xE5\xB0\xB1\xE5\xA4\xAA\xE7\xA6\xBB\xE8\xB0\xB1\xE4\xBA\x86\xE3\x80\x82\x3C/span\x3E"

Previously, I was trying able to decode this string using the following method, but it does not support utf-8:

text = text.encode().decode("unicode_escape")

I wonder if there is a way to turn str objects whose literal content is in utf-8 into byte objects.

Does [this](https://stackoverflow.com/questions/21665709/python-convert-utf-8-string-to-byte-string) answer your question? — Ceres, Feb 22 '21 at 09:46

score 2 · Answer 1 · answered Feb 22 '21 at 09:51

In python3 this can be decoded as follows:

# put a b in front of the string to make it bytes
s = b"\x3Cspan\x3E\xE6\x88\x91\xE5\x8F\x91\xE7\x8E\xB0\xE6\x88\x91\xE5\xB1\x85\xE7\x84\xB6\xE6\xB2\xA1\xE6\x9C\x89\xE6\xB5\x8B\xE8\xAF\x95\xE8\xBF\x87\xE4\xB8\xAD\xE6\x96\x87\xEF\xBC\x8C\xE8\xBF\x99\xE4\xB8\xAA\xE5\xB0\xB1\xE5\xA4\xAA\xE7\xA6\xBB\xE8\xB0\xB1\xE4\xBA\x86\xE3\x80\x82\x3C/span\x3E"
import chardet
encoding = chardet.detect(s)
content = s.decode(encoding['encoding'])
content

It decodes to

<span>我发现我居然没有测试过中文，这个就太离谱了。</span>

The problem is that the string is stored in an str object, therefore this method does not work. I have found the working method, but thanks anyways:-) — Sushi Bear, Feb 23 '21 at 15:25

Maurice Meyer · Answer 2 · 2021-02-22T10:03:30.607

1

If your data is a Python string, you need to convert to bytes (preserving backslashes) first, before decoding:

>>> variable = "\x3Cspan\x3E\xE6\x88\x91\xE5\x8F\x91\xE7\x8E\xB0\xE6\x88\x91\xE5\xB1\x85\xE7\x84\xB6\xE6\xB2\xA1\xE6\x9C\x89\xE6\xB5\x8B\xE8\xAF\x95\xE8\xBF\x87\xE4\xB8\xAD\xE6\x96\x87\xEF\xBC\x8C\xE8\xBF\x99\xE4\xB8\xAA\xE5\xB0\xB1\xE5\xA4\xAA\xE7\xA6\xBB\xE8\xB0\xB1\xE4\xBA\x86\xE3\x80\x82\x3C/span\x3E"
>>> y = variable.encode('raw_unicode_escape')
>>> print (y.decode('utf-8'))
<span>我发现我居然没有测试过中文，这个就太离谱了。</span>

edited Feb 22 '21 at 10:03

answered Feb 22 '21 at 09:54

Maurice Meyer

17,279
4
30
47

It can be converted into a byte string by specifying b in front of the string. There is no need to encode the string at first. – revmatcher Feb 22 '21 at 09:55
@revmatcher: OP says the data **were stored in a str object**, so data needs to be converted to bytes first. – Maurice Meyer Feb 23 '21 at 09:47
Got it, thanks. (I hadn't used stack overflow before. Sorry :-) – Sushi Bear Feb 25 '21 at 02:06

score 0 · Accepted Answer · answered Feb 22 '21 at 12:47

0

This can also be solved by importing the unquote function from urllib.parse, and it can change the %XX into text.

answered Feb 22 '21 at 12:47

Sushi Bear

53
1
5

revmatcher · Answer 4 · 2021-02-22T10:00:08.200

-1

Here is a simple solution. Add the letter b in front of the string to convert it into a byte string and then directly decode it. This should work.

encoded_str = b'\x3Cspan\x3E\xE6\x88\x91\xE5\x8F\x91\xE7\x8E\xB0\xE6\x88\x91\xE5\xB1\x85\xE7\x84\xB6\xE6\xB2\xA1\xE6\x9C\x89\xE6\xB5\x8B\xE8\xAF\x95\xE8\xBF\x87\xE4\xB8\xAD\xE6\x96\x87\xEF\xBC\x8C\xE8\xBF\x99\xE4\xB8\xAA\xE5\xB0\xB1\xE5\xA4\xAA\xE7\xA6\xBB\xE8\xB0\xB1\xE4\xBA\x86\xE3\x80\x82\x3C/span\x3E'
print(encoded_str.decode('utf-8'))

This gives

<span>我发现我居然没有测试过中文，这个就太离谱了。</span>

edited Feb 22 '21 at 10:00

answered Feb 22 '21 at 09:54

revmatcher

757
8
17

The problem is that the string is stored in an str object, therefore this method does not work. I have found the working method, but thanks anyways:-) – Sushi Bear Feb 23 '21 at 15:25

In python, how to decode strings whose literal content is in utf-8?

4 Answers4