0

There is similiar question, but the solution doesn't seem to work.

Say I've encoded a string:

>>> a = 'dada大大'.encode('utf-8')
>>> type(a)
<class 'bytes'>
>>> a
>>> b'dada\xe5\xa4\xa7\xe5\xa4\xa7'

What I want is something like this:

dada\xe5\xa4\xa7\xe5\xa4\xa7

str(a) doesn't work:

>>> str(a)
>>> "b'dada\\xe5\\xa4\\xa7\\xe5\\xa4\\xa7'"

I've tried redirect stdout to a variable, but still, I got "b'dada\\xe5\\xa4\\xa7\\xe5\\xa4\\xa7'".

I can deal with it using regular expression and get what I want, but I'm searching for a more pythonic way to do this. Any suggestions?

Community
  • 1
  • 1
laike9m
  • 18,344
  • 20
  • 107
  • 140
  • 1
    "the solution doesn't seem to work" - be more specific! Do you get errors (if so, provide full traceback)? Odd outputs (provide inputs, expected outputs, actual outputs)? – jonrsharpe Jan 09 '14 at 14:42
  • what output are you looking for? the string representation looks like what you want (of course you are seeing the dual \\ because they need escaped)... – Corley Brigman Jan 09 '14 at 14:42
  • Is there any particular reason you want to strip off the `b` and the single quotes? `str(a)` has pretty close to what you want. You're seeing the `repr` of the string, which has extra escaping. If you `print(str(a))`, you'll see the string's contents rather than an expression that evaluates to the string. – user2357112 Jan 09 '14 at 14:43
  • @user2357112 Yes, indeed. I just want the string literal. – laike9m Jan 09 '14 at 14:47
  • Would simply `repr(a)[2:-1]` work, that is, the string representation but without the `b''`? – RemcoGerlich Jan 09 '14 at 14:50
  • Wild guess: do you maybe want the unicode codepoints which match the bytes value (which would be latin1 decoding, because the first 255 unicode codepoints are equal to latin1)? This means, given a bytes object like ``b'\x01foo\x61'``, you want a string with ``"\x01fooa"``? – Jonas Schäfer Jan 09 '14 at 14:57
  • @RemcoGerlich It's get rid of `b''` as well as convert \\ to \ – laike9m Jan 09 '14 at 15:11

3 Answers3

5

As you were so nice to mention your actual problem in a comment, I’ll update my answer once more to respond to that actually. The original answer can be seen below.

It's the string I post to Github Markdown API. This is the only way that unicode character can be accepted. I got the rendered html with the orignal character dada大大

The GitHub Markdown API requires you to send the data as JSON. JSON itself borrows the string escaping from JavaScript, which would be \u5927 for this character. When using the json module however, you don’t need to worry about that at all:

from urllib import urlopen
import json

text = 'dada大大'
data = json.dumps({ mode: 'markdown', 'text': text }).encode()
r = urlopen('https://api.github.com/markdown', data)

print(r.read().decode()) # <p>dada大大</p>

As you can see, the API accepts the encoded text without problems and correctly produces the correct output, without having to worry about the encoding.

Or when using the raw API with the requests library:

h = { 'Content-Type': 'text/plain' }
r = requests.post('https://api.github.com/markdown/raw', text.encode(), headers=h)

print(r.content.decode()) # <p>dada大大</p>

Original answer

>>> a = 'dada大大'.encode('utf-8')
>>> a
b'dada\xe5\xa4\xa7\xe5\xa4\xa7'
>>> str(a)
"b'dada\\xe5\\xa4\\xa7\\xe5\\xa4\\xa7'"
>>> str(a)[2:-1]
'dada\\xe5\\xa4\\xa7\\xe5\\xa4\\xa7'
>>> print(_)
dada\xe5\xa4\xa7\xe5\xa4\xa7

When you just do str(a) you will get the string representation of the bytes string. Of course, when you just use it like that in the interpreter, the interpreter will actually call repr on it to display it. And a string that contains backslashes will have them escaped as \\. That’s where those came from.

And finally, you have to strip of the b' and the trailing ' to get just the content of the string representation of the bytes string.

Side note: str() and repr() will produce the same result when used on bytes objects.


According to Poke's answer, what I need is preventing autoescaping of repr.

No, you don’t. There are no double backslashes in the final string. They only appear because when you enter stuff in your REPL, it will output the return values of things to the console after calling repr on them. But that does not mean, that the actual string suddenly got changed:

>>> s = str(a)[2:-1]
>>> len(s)
28
>>> list(s)
['d', 'a', 'd', 'a', '\\', 'x', 'e', '5', '\\', 'x', 'a', '4', '\\', 'x', 'a', '7', '\\', 'x', 'e', '5', '\\', 'x', 'a', '4', '\\', 'x', 'a', '7']

As you can see, there are not double backslashes in the string. Yes, you can see them again, but that’s again only because the return value of list(s) is being printed by the REPL. Each item of the list is a single character though, including the backslashes. They are just escaped again because '\' wouldn’t be a valid string.

>>> '\'
SyntaxError: EOL while scanning string literal
>>> '\\'
'\\'
>>> len('\\')
1
poke
  • 369,085
  • 72
  • 557
  • 602
  • 1
    @laike9m As I explained, there is no auto escaping. `str(a)[2:-1]` is a string—with only single backslashes, not double backslashes. But when the *interpreter* outputs the return value it automatically calls `repr` on it which creates a *string* reprentation to be used as if you wrote that string in Python itself. And then you would have to escape those backslashes which is why you see them. But as you can see, when you print the string or output it *anywhere*, there are no escaped backslashes. – poke Jan 09 '14 at 15:18
4

bytes is really an array of integers:

>>> a = 'dada大大'.encode() # 'utf-8' by default
>>> list(a)
[100, 97, 100, 97, 229, 164, 167, 229, 164, 167]

You can get the hex values of each of these using

>>> list(map(hex, a))
['0x64', '0x61', '0x64', '0x61', '0xe5', '0xa4', '0xa7', '0xe5', '0xa4', '0xa7']

And therefore

>>> list(chr(x) if x < 128 else hex(x) for x in a)
['d', 'a', 'd', 'a', '0xe5', '0xa4', '0xa7', '0xe5', '0xa4', '0xa7']

>>> print("".join(chr(x) if x < 128 else hex(x).replace("0", "\\") for x in a))
dada\xe5\xa4\xa7\xe5\xa4\xa7
jonrsharpe
  • 115,751
  • 26
  • 228
  • 437
  • Why go the overhead of converting the integers manually to hex, when calling `str()` on a bytes object already does all that? – poke Jan 09 '14 at 15:51
0

OK, finally I found the solution, it's from Python Replace \\ with \

a = 'dada大大'.encode('utf-8')
b = str(a)[2:-1].encode('utf-8').decode('unicode_escape')

Maybe I should have explained what I want clearer.

Edit - My test result

>>> import requests
>>> text = 'dada大大'
>>> h = {'Content-Type': 'text/plain'}
>>> r = requests.post('https://api.github.com/markdown/raw', text.encode(), headers=h)
>>> print(r.content.decode())
{"message":"Invalid request media type (expecting 'text/plain')","documentation_url":"http://developer.github.com/v3/markdown/#render-a-markdown-document-in-raw-mode"}
>>> print(r.content.decode('utf-8'))
{"message":"Invalid request media type (expecting 'text/plain')","documentation_url":"http://developer.github.com/v3/markdown/#render-a-markdown-document-in-raw-mode"}
>>> r = requests.post('https://api.github.com/markdown/raw', text.encode('utf-8'), headers=h)
>>> print(r.content.decode('utf-8'))
{"message":"Invalid request media type (expecting 'text/plain')","documentation_url":"http://developer.github.com/v3/markdown/#render-a-markdown-document-in-raw-mode"}
Community
  • 1
  • 1
laike9m
  • 18,344
  • 20
  • 107
  • 140
  • That makes no sense. Did you actually look at the result? – poke Jan 09 '14 at 16:42
  • `dada大大`? How does that even remotely look like the `dada\xe5\xa4\xa7\xe5\xa4\xa7` you mentioned in the question as what you wanted? – poke Jan 09 '14 at 16:49
  • @poke it's the string I post to Github Markdown API. This is the only way that unicode character can be accepted. I got the rendered html with the orignal character `dada大大`. – laike9m Jan 09 '14 at 16:54
  • You really should have mentioned that in your question. Because it has nothing to do with what we all tried here. See my updated answer. – poke Jan 09 '14 at 17:08
  • @poke I don't want to send json data = str(self.body)[2:-1].encode('utf-8').decode('unicode_escape'); headers = {'Content-Type': 'text/plain'}; r = requests.post('https://api.github.com/markdown/raw', headers=headers, data=data); is enough – laike9m Jan 09 '14 at 17:13
  • @poke That's insteresting. I had tried posting encoded text before asking the question. I've tested it again, see my edit. – laike9m Jan 10 '14 at 14:15
  • That’s very odd. It works for me exactly like that when I copy and paste your code. I’m using Python 3.3.3 and requests 2.1.0. – poke Jan 10 '14 at 16:00
  • @poke OK everything solved. I was using requests 1.2.3 and after an upgrade to 2.2.0 it works as expected. – laike9m Jan 10 '14 at 16:06
  • Don’t worry about it, I’m just happy it’s over xD Just a tip for the next time: Mention what you’re actually trying to do in the question, to prevent asking a [XY problems](http://meta.stackexchange.com/questions/66377/what-is-the-xy-problem) :) – poke Jan 10 '14 at 17:30