Dump JSON from string in unknown character encoding

Question

I'm trying to dump HTML from websites into JSON, and I need a way to handle the different character encodings.

I've read that if it isn't utf-8, it's probably ISO-8859-1, so what I'm doing now is:

for possible_encoding in ["utf-8", "ISO-8859-1"]:
   try:
      # post_dict contains, among other things, website html retrieved
      # with urllib2
      json = simplejson.dumps(post_dict, encoding=possible_encoding)
      break
   except UnicodeDecodeError:
      pass
if json is None:
      raise UnicodeDecodeError

This will of course fail if I come across any other encodings, so I'm wondering if there is a way to solve this problem in the general case.

The reason I'm trying to serialize the HTML in the first place is because I need to send it in a POST request to our NodeJS server. So, if someone has a different solution that allows me to do that (maybe without serializing to JSON at all), I'd be happy to hear that as well.

Does the HTML contain any `` tags? If so, you could check them to see if any of them are `` and see if the `content` attribute tells you the character encoding. Alternatively, when retreiving the HTML in the first place, you may want to see if the `Content-Type` header includes an encoding. — Niet the Dark Absol, Jan 29 '13 at 20:23
In python, `"ISO-8859-1"` actually means ISO-8859-1. In web pages, ISO-8859-1 means Windows-1252 (`cp1252` in python). Browsers actually use Windows-1252 to decode claimed ISO-8859-1 and this is specified in the html5 draft. So you want to specify `["utf-8", "cp1252"]`. — Esailija, Jan 30 '13 at 13:27
See the replacement encodings here http://www.w3.org/TR/2009/WD-html5-20090423/infrastructure.html#character-encodings-0 — Esailija, Jan 30 '13 at 13:34

score 1 · Accepted Answer · edited May 23 '17 at 12:04

You should know the character encoding regardless of media type you use to send POST request (unless you want to send binary blobs). To get the character encoding of your html content, see A good way to get the charset/encoding of an HTTP response in Python .

To send post_dict as json, make sure all strings in it are Unicode (just convert html to Unicode as soon as you receive it) and don't use the encoding parameter for json.dumps() call. The parameter won't help you anyway if different web-sites (where you get your html strings) use different encodings.

Dump JSON from string in unknown character encoding

1 Answers1