How to encode Chinese character as 'gbk' in json, to format a url request parameter String?

Question

I want to dump a dict as a json String which contains some Chinese characters, and format a url request parameter with that.

here is my python code:

import httplib
import simplejson as json
import urllib

d={
  "key":"上海",
  "num":1
}

jsonStr = json.dumps(d,encoding='gbk')
url_encode=urllib.quote_plus(jsonStr)

conn = httplib.HTTPConnection("localhost",port=8885)
conn.request("GET","/?json="+url_encode)
res = conn.getresponse()

what I expected of the request string is this:

GET /?json=%7B%22num%22%3A+1%2C+%22key%22%3A+%22%C9%CF%BA%A3%22%7D
                                                ------------
                                                     |
                                                     V
                       "%C9%CF%BA%A3" represent "上海" in format of 'gbk' in url.

but what I got is this:

GET /?json=%7B%22num%22%3A+1%2C+%22key%22%3A+%22%5Cu6d93%5Cu5a43%5Cu6363%22%7D
                                                ------------------------
                                                         |
                                                         v
           %5Cu6d93%5Cu5a43%5Cu6363  is 'some' format of chinese characters "上海"

I also tried to dump json with ensure_ascii=False option:

jsonStr = json.dumps(d,ensure_ascii=False,encoding='gbk')

but get no luck.

so, how can I make this work? thanks.

score 2 · Answer 1 · edited May 23 '17 at 11:58

You almost got it with ensure_ascii=False. This works:

jsonStr = json.dumps(d, encoding='gbk', ensure_ascii=False).encode('gbk')

You need to tell json.dumps() that the strings it will read are GBK, and that it should not try to ASCII-fy them. Then you must re-specify the output encoding, because json.dumps() has no separate option for that.

This solution is similar to another answer here: https://stackoverflow.com/a/18337754/4323

So this does what you want, though I should note that the standard for URIs seems to say that they should be in UTF-8 whenever possible. For more on this, see here: https://stackoverflow.com/a/14001296/4323

thanks john, for explaining all the stuff clearly. really helpful. — armnotstrong, Oct 09 '14 at 07:04

score 2 · Answer 2 · answered Oct 09 '14 at 08:04

2

"key":"上海",

You saved your source code as UTF-8, so this is the byte string '\xe4\xb8\x8a\xe6\xb5\xb7'.

jsonStr = json.dumps(d,encoding='gbk')

The JSON format supports only Unicode strings. The encoding parameter can be used to force json.dumps into allowing byte strings, automatically decoding them to Unicode using the given encoding.

However, the byte string's encoding is actually UTF-8 not 'gbk', so json.dumps decodes incorrectly, giving u'涓婃捣'. It then produces the incorrect JSON output "\u6d93\u5a43\u6363", which gets URL-encoded to %22%5Cu6d93%5Cu5a43%5Cu6363%22.

To fix this you should make the input to json.dumps a proper Unicode (u'') string:

# coding: utf-8

d = {
    "key": u"上海",  # or u'\u4e0a\u6d77' if you don't want to rely on the coding decl
    "num":1
}
jsonStr = json.dumps(d)
...

This will get you JSON "\u4e0a\u6d77", encoding to URL %22%5Cu4e0a%5Cu6d77%22.

If you really don't want the \u escapes in your JSON you can indeed ensure_ascii=False and then .encode() the output before URL-encoding. But I wouldn't recommend it as you would then have to worry about what encoding the target application wants in its URL parameters, which is a source of some pain. The \u version is accepted by all JSON parsers, and is not typically much longer once URL-encoded.

answered Oct 09 '14 at 08:04

bobince

528,062
107
651
834

thanks for pointing out my misunderstanding of the parameter of `encoding` in `json.dumps()`. But there is still something that confusing me. if I set `key=u'上海'`，I got `key` as unicode, right? so when I typed `key` in console, i will get `u'\u4e0a\u6d77'`, but `key.encode('utf8')`produce`'\xe4\xb8\x8a\xe6\xb5\xb7'`, I know python use unicode as default encoding, are unicode and utf8 the same? or just another encoding format? what's the relationship between `u'\u4e0a\u6d77'` and `'\xe4\xb8\x8a\xe6\xb5\xb7'` and what is the data for '上海' stored in python's memory anyway? really confused. – armnotstrong Oct 09 '14 at 16:24
Unicode strings are a sequence of characters, numbered 0x000000–0x10FFFF. Byte strings are a sequence of bytes, numbered 0x00–0xFF. There are many encodings that map some or all of the characters onto one or a sequence of bytes. Many but not all encodings are ASCII-compatible in that each of the bytes 0x00-0x7F maps directly to the Unicode character with the same number. An encoding that maps all of the characters to byte sequences is called a UTF. UTF-8 is generally considered preferable as it is the most compact ASCII-compatible UTF. – bobince Oct 10 '14 at 08:33
Unicode is not an encoding but Microsoft confuses people by calling the UTF-16LE encoding “Unicode” in the interfaces of apps like Notepad. UTF-16LE is not ASCII-compatible and is generally problematic as an exchange format but many systems use it as an internal storage format, including Python 2.x under Windows. But on other environments Python may use other encodings. From Python 3.3 on all platforms it switches between multiple encodings. As a Python script author you generally don't need to care what the storage format behind the scenes is. – bobince Oct 10 '14 at 08:39

How to encode Chinese character as 'gbk' in json, to format a url request parameter String?

2 Answers2