As you were so nice to mention your actual problem in a comment, I’ll update my answer once more to respond to that actually. The original answer can be seen below.
It's the string I post to Github Markdown API. This is the only way that unicode character can be accepted. I got the rendered html with the orignal character dada大大
The GitHub Markdown API requires you to send the data as JSON. JSON itself borrows the string escaping from JavaScript, which would be \u5927
for this character. When using the json
module however, you don’t need to worry about that at all:
from urllib import urlopen
import json
text = 'dada大大'
data = json.dumps({ mode: 'markdown', 'text': text }).encode()
r = urlopen('https://api.github.com/markdown', data)
print(r.read().decode()) # <p>dada大大</p>
As you can see, the API accepts the encoded text without problems and correctly produces the correct output, without having to worry about the encoding.
Or when using the raw API with the requests
library:
h = { 'Content-Type': 'text/plain' }
r = requests.post('https://api.github.com/markdown/raw', text.encode(), headers=h)
print(r.content.decode()) # <p>dada大大</p>
Original answer
>>> a = 'dada大大'.encode('utf-8')
>>> a
b'dada\xe5\xa4\xa7\xe5\xa4\xa7'
>>> str(a)
"b'dada\\xe5\\xa4\\xa7\\xe5\\xa4\\xa7'"
>>> str(a)[2:-1]
'dada\\xe5\\xa4\\xa7\\xe5\\xa4\\xa7'
>>> print(_)
dada\xe5\xa4\xa7\xe5\xa4\xa7
When you just do str(a)
you will get the string representation of the bytes string. Of course, when you just use it like that in the interpreter, the interpreter will actually call repr
on it to display it. And a string that contains backslashes will have them escaped as \\
. That’s where those came from.
And finally, you have to strip of the b'
and the trailing '
to get just the content of the string representation of the bytes string.
Side note: str()
and repr()
will produce the same result when used on bytes objects.
According to Poke's answer, what I need is preventing autoescaping of repr
.
No, you don’t. There are no double backslashes in the final string. They only appear because when you enter stuff in your REPL, it will output the return values of things to the console after calling repr
on them. But that does not mean, that the actual string suddenly got changed:
>>> s = str(a)[2:-1]
>>> len(s)
28
>>> list(s)
['d', 'a', 'd', 'a', '\\', 'x', 'e', '5', '\\', 'x', 'a', '4', '\\', 'x', 'a', '7', '\\', 'x', 'e', '5', '\\', 'x', 'a', '4', '\\', 'x', 'a', '7']
As you can see, there are not double backslashes in the string. Yes, you can see them again, but that’s again only because the return value of list(s)
is being printed by the REPL. Each item of the list is a single character though, including the backslashes. They are just escaped again because '\'
wouldn’t be a valid string.
>>> '\'
SyntaxError: EOL while scanning string literal
>>> '\\'
'\\'
>>> len('\\')
1