14

I'm a bit surprised that it's so complicated to get a charset of a webpage with Python. Am I missing a way? The HTTPMessage has loads of functions, but not this.

>>> google = urllib2.urlopen('http://www.google.com/')
>>> google.headers.gettype()
'text/html'
>>> google.headers.getencoding()
'7bit'
>>> google.headers.getcharset()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: HTTPMessage instance has no attribute 'getcharset'

So you have to get the header, and split it. Twice.

>>> google = urllib2.urlopen('http://www.google.com/')
>>> charset = 'ISO-8859-1'
>>> contenttype = google.headers.getheader('Content-Type', '')
>>> if ';' in contenttype:
...     charset = contenttype.split(';')[1].split('=')[1]
>>> charset
'ISO-8859-1'

That's a surprising amount of steps for such a basic function. Am I missing something?

Lennart Regebro
  • 167,292
  • 41
  • 224
  • 251
  • 2
    From RFC 2616 (HTTP1.1) `The "charset" parameter is used with some media types to define the character set (section 3.4) of the data. When no explicit charset parameter is provided by the sender, media subtypes of the "text" type are defined to have a default charset value of "ISO-8859-1" when received via HTTP.`, as a side-note to your default being ASCII. – plundra Dec 22 '10 at 15:05
  • @plundra: Well, ISO-8859-1 is a superset of ASCII, but you're correct - it's a different encoding. – Piskvor left the building Dec 22 '10 at 15:07
  • @Piskvor: And if one were to use the `charset` from above with s.decode() for example, things will break (with pages sending iso-8859-1 and relying on implicit) – plundra Dec 22 '10 at 15:11
  • Ah, so I should check for the type, and if it's text it should default to latin-1, and otherwise it's presumably binary and shouldn't be decoded at all. :) Yet another step of complexity. – Lennart Regebro Dec 22 '10 at 15:53

4 Answers4

6

Have you checked this?

How to download any(!) webpage with correct charset in python?

Community
  • 1
  • 1
Leniel Maccaferri
  • 100,159
  • 46
  • 371
  • 480
5

I did some research and came up with this solution:

response = urllib.request.urlopen(url)
encoding = response.headers.get_content_charset()

This is how I would do it in Python 3. I haven't tested it in Python 2 but I am guessing that you would have to use urllib2.request instead of urllib.request.

Here is how it works, since the official Python documentation doesn't explain it very well: the result of urlopen is an http.client.HTTPResponse object. The headers property of this object is an http.client.HTTPMessage object, which, according to the documentation, "is implemented using the email.message.Message class", which has a method called get_content_charset, which tries to determine and return the character set of the response.

By default, this method returns None if it is unable to determine the character set, but you can override this behavior instead by passing a failobj parameter:

encoding = response.headers.get_content_charset(failobj="utf-8")
Elias Zamaria
  • 96,623
  • 33
  • 114
  • 148
  • 1
    `get_content_charset` isn't available in Python 2. You should be able to use `headers.getparam("charset")` instead (Python 2 only; Python 3 renames it to `get_param`). – Josh Kelley Jul 11 '15 at 02:21
0

You're not missing anything. It's doing the right thing - encoding of a HTTP response is a subpart of Content-Type.

Note also that some pages might send only Content-Type: text/html and then set the encoding via <meta http-equiv="Content-Type" content="text/html; charset=utf-8"> - that's an ugly hack though (on the part of the page author) and is not too common.

Piskvor left the building
  • 91,498
  • 46
  • 177
  • 222
0

I would go with chardet Universal Encoding Detector.

>>> import urllib
>>> urlread = lambda url: urllib.urlopen(url).read()
>>> import chardet
>>> chardet.detect(urlread("http://google.cn/"))
{'encoding': 'GB2312', 'confidence': 0.99}

You are doing right but your approach would fail for pages where charset is declared on meta tag or is not declared at all.
If you look closer at Chardet sources, it has a charsetprober/charsetgroupprober modules that deals with this problem nicely.

systempuntoout
  • 71,966
  • 47
  • 171
  • 241
  • For me, this is not a good answer: chardet is "guessing the encoding of [the HTML] file" (see https://github.com/erikrose/chardet). You should, of course, first start by looking in the headers if it's declared. See the question pointed to by Leniel. – lajarre Jul 11 '13 at 16:02