4

I was trying to make an add-on for Anki which imports the opml notes from Mubu, and I the contents that I needed were stored in a str object like the one below, and I was not able to decode them or convert them into byte objects.

"\x3Cspan\x3E\xE6\x88\x91\xE5\x8F\x91\xE7\x8E\xB0\xE6\x88\x91\xE5\xB1\x85\xE7\x84\xB6\xE6\xB2\xA1\xE6\x9C\x89\xE6\xB5\x8B\xE8\xAF\x95\xE8\xBF\x87\xE4\xB8\xAD\xE6\x96\x87\xEF\xBC\x8C\xE8\xBF\x99\xE4\xB8\xAA\xE5\xB0\xB1\xE5\xA4\xAA\xE7\xA6\xBB\xE8\xB0\xB1\xE4\xBA\x86\xE3\x80\x82\x3C/span\x3E"

Previously, I was trying able to decode this string using the following method, but it does not support utf-8:

text = text.encode().decode("unicode_escape")

I wonder if there is a way to turn str objects whose literal content is in utf-8 into byte objects.

Sushi Bear
  • 53
  • 1
  • 5
  • Does [this](https://stackoverflow.com/questions/21665709/python-convert-utf-8-string-to-byte-string) answer your question? – Ceres Feb 22 '21 at 09:46

4 Answers4

2

In python3 this can be decoded as follows:

# put a b in front of the string to make it bytes
s = b"\x3Cspan\x3E\xE6\x88\x91\xE5\x8F\x91\xE7\x8E\xB0\xE6\x88\x91\xE5\xB1\x85\xE7\x84\xB6\xE6\xB2\xA1\xE6\x9C\x89\xE6\xB5\x8B\xE8\xAF\x95\xE8\xBF\x87\xE4\xB8\xAD\xE6\x96\x87\xEF\xBC\x8C\xE8\xBF\x99\xE4\xB8\xAA\xE5\xB0\xB1\xE5\xA4\xAA\xE7\xA6\xBB\xE8\xB0\xB1\xE4\xBA\x86\xE3\x80\x82\x3C/span\x3E"
import chardet
encoding = chardet.detect(s)
content = s.decode(encoding['encoding'])
content

It decodes to

<span>我发现我居然没有测试过中文,这个就太离谱了。</span>
forgetso
  • 2,194
  • 14
  • 33
  • The problem is that the string is stored in an str object, therefore this method does not work. I have found the working method, but thanks anyways:-) – Sushi Bear Feb 23 '21 at 15:25
1

If your data is a Python string, you need to convert to bytes (preserving backslashes) first, before decoding:

>>> variable = "\x3Cspan\x3E\xE6\x88\x91\xE5\x8F\x91\xE7\x8E\xB0\xE6\x88\x91\xE5\xB1\x85\xE7\x84\xB6\xE6\xB2\xA1\xE6\x9C\x89\xE6\xB5\x8B\xE8\xAF\x95\xE8\xBF\x87\xE4\xB8\xAD\xE6\x96\x87\xEF\xBC\x8C\xE8\xBF\x99\xE4\xB8\xAA\xE5\xB0\xB1\xE5\xA4\xAA\xE7\xA6\xBB\xE8\xB0\xB1\xE4\xBA\x86\xE3\x80\x82\x3C/span\x3E"
>>> y = variable.encode('raw_unicode_escape')
>>> print (y.decode('utf-8'))
<span>我发现我居然没有测试过中文,这个就太离谱了。</span>
Maurice Meyer
  • 17,279
  • 4
  • 30
  • 47
0

This can also be solved by importing the unquote function from urllib.parse, and it can change the %XX into text.

Sushi Bear
  • 53
  • 1
  • 5
-1

Here is a simple solution. Add the letter b in front of the string to convert it into a byte string and then directly decode it. This should work.

encoded_str = b'\x3Cspan\x3E\xE6\x88\x91\xE5\x8F\x91\xE7\x8E\xB0\xE6\x88\x91\xE5\xB1\x85\xE7\x84\xB6\xE6\xB2\xA1\xE6\x9C\x89\xE6\xB5\x8B\xE8\xAF\x95\xE8\xBF\x87\xE4\xB8\xAD\xE6\x96\x87\xEF\xBC\x8C\xE8\xBF\x99\xE4\xB8\xAA\xE5\xB0\xB1\xE5\xA4\xAA\xE7\xA6\xBB\xE8\xB0\xB1\xE4\xBA\x86\xE3\x80\x82\x3C/span\x3E'
print(encoded_str.decode('utf-8'))

This gives

<span>我发现我居然没有测试过中文,这个就太离谱了。</span>
revmatcher
  • 757
  • 8
  • 17
  • The problem is that the string is stored in an str object, therefore this method does not work. I have found the working method, but thanks anyways:-) – Sushi Bear Feb 23 '21 at 15:25