1

As an exercise I built a little script that query Google Suggest JSON API. The code is quite simple:

query = 'a'
url = "http://clients1.google.co.jp/complete/search?hl=ja&q=%s&json=t" %query
response = urllib.urlopen(url)
result = json.load(response)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x83 in position 0: invalid start byte

If I try to read() the response object, this is what I've got:

'["a",["amazon","ana","au","apple","adobe","alc","\x83A\x83}\x83]\x83\x93","\x83A\x83\x81\x83u\x83\x8d","\x83A\x83X\x83N\x83\x8b","\x83A\x83\x8b\x83N"],["","","","","","","","","",""]]'

So it seams that the error is raised when python try to decode the string. This only happens with google.co.jp and the Japanese language. I tried the same code with different contry/languages and I do not get the same issue: when I try to deserialize the object everything works OK.

  • I checked the response headers for and they always specify utf-8 as the response encoding.
  • I checked the JSON string with an online parser (http://json.parser.online.fr/) and again all seams OK

Any ideas to solve this problem? What make the JSON load() function choke?

Thanks in advance.

raben
  • 3,060
  • 5
  • 32
  • 34

2 Answers2

3

The response header (print response.header) contains the following information:

Content-Type: text/javascript; charset=Shift_JIS

Note the charset.

If you specify this encoding in json.load it will work:

result = json.load(response, encoding='shift_jis')
Gary Kerr
  • 13,650
  • 4
  • 48
  • 51
  • There's your problem: JSON is never supposed to be transferred in Shift_JIS: the only valid encodings for JSON are UTF-*. Furthermore, the actual content type of `text/javascript` is weird. Additionally, when I open that URL with Firefox, I get the same response, but in UTF-8. – Thanatos Dec 07 '10 at 15:34
  • @Thanatos It seems that Google make a check on the User-Agent string. If you make a request for that URL with a common browser User-Agent string (tested with IExplorer 6, Firefox 3, Safari..) you get the response encoded in UTF-8. I don't know why. – raben Dec 07 '10 at 18:03
  • @Raben: I was thinking it might actually be the "Accept-Encoding" header - Firefox seems to explicitly request the result in either ISO-8859-1 or UTF-8, which urllib does not. Regardless, the answer given to urllib is wrong. – Thanatos Dec 07 '10 at 19:02
0

Regardless of what the spec says, the string "\x83A\x83}\x83]\x83\x93" is not UTF-8.

At a guess, it is one of [ "cp932", "shift_jis", "shift_jis_2004", "shift_jisx0213" ]; try decoding as one of these.

Hugh Bothwell
  • 55,315
  • 8
  • 84
  • 99