decode json encoded as GB2312

Question

Via a GET request, I pull json from the Google geocode API:

import urllib, urllib2

url = "http://maps.googleapis.com/maps/api/geocode/json"
params = {'address': 'ivory coast', 'sensor': 'false'}
request = urllib2.Request(url + "?" + urllib.urlencode(params))
response = urllib2.urlopen(request)
st = response.read()

What comes out looks like:

{
   "results" : [
      {
         "address_components" : [
            {
               "long_name" : "CÃ´te d'Ivoire",
               "short_name" : "CI",
               "types" : [ "country", "political" ]
            }
         ],
         "formatted_address" : "CÃ´te d'Ivoire",
         "geometry" : { ... # rest snipped

As you see, the country name has some encoding issues. I tried to guess the encoding like this:

import chardet
encoding = chardet.detect(st)
print "String is encoded in {0} (with {1}% confidence).".format(encoding['encoding'], encoding['confidence']*100)

Which returns:

String is encoded in GB2312 (with 99.0% confidence).

What I would like to know is how I can convert this into a dictionary with an encoding where the ô (o with circumflex) is properly displayed.

I tried:

st = st.decode(encoding['encoding']).encode('utf-8')

But then I get:

{
   "results" : [
      {
         "address_components" : [
            {
               "long_name" : "Cä¹ˆte d'Ivoire",
               "short_name" : "CI",
               "types" : [ "country", "political" ]
            }
         ],
         "formatted_address" : "Cä¹ˆte d'Ivoire",
         "geometry" : { ... # rest snipped

i don't think the charset is GB2312. chardet may return a wrong result. — lucemia, Dec 19 '12 at 17:58

score 3 · Accepted Answer · answered Dec 19 '12 at 18:08

3

The google api results are always encoded in UTF-8, you can even read this manually from their HTTP Content-Type header:

enter image description here

answered Dec 19 '12 at 18:08

Esailija

138,174
23
272
326

This is because [RFC 4627](http://www.ietf.org/rfc/rfc4627.txt) says "3. Encoding : JSON text SHALL be encoded in Unicode. The default encoding is UTF-8." – Mike Samuel Dec 19 '12 at 18:17
@MikeSamuel but it can also be UTF-16 or UTF-32. – Esailija Dec 19 '12 at 18:18
OK, so `chardet` is wrong. but 1) `u"Côte d'Ivoire".encode('UTF-8')` returns `"C\xc3\xb4te d'Ivoire"`, so why do I get `"CÃ´te d'Ivoire"` then? 2) my main question still remains, how do I turn the json I received into something that properly displays the `ô`? – BioGeek Dec 19 '12 at 18:19
1

@BioGeek `string = rawresponse.decode("utf-8")` Then, json decode the `string` variable. – Esailija Dec 19 '12 at 18:20
@Esailija, sure, or any other encoding, but the reason Google chose that is that UTF-8 is the default specified by the standard, and the internally mandated encoding for text in protocol buffers and strings in C++. Any service that produces JSON in an encoding other than UTF-8 is choosing to do something non-default. – Mike Samuel Dec 19 '12 at 18:25
@BioGeek: It could be that the bit that uploaded or stored the data has misencoded it, and your data needs to be scrubbed. – Ignacio Vazquez-Abrams Dec 19 '12 at 18:34

score 2 · Answer 2 · answered Dec 19 '12 at 18:14

2

Once you've (properly) decoded it, don't re-encode it; json can work with unicode perfectly well.

>>> json.loads(u"[\"C\xf4te d'Ivoire\"]")
[u"C\xf4te d'Ivoire"]

answered Dec 19 '12 at 18:14

Ignacio Vazquez-Abrams

776,304
153
1,341
1,358

decode json encoded as GB2312

2 Answers2