35

Problem

When screen-scraping a webpage using python one has to know the character encoding of the page. If you get the character encoding wrong than your output will be messed up.

People usually use some rudimentary technique to detect the encoding. They either use the charset from the header or the charset defined in the meta tag or they use an encoding detector (which does not care about meta tags or headers). By using only one these techniques, sometimes you will not get the same result as you would in a browser.

Browsers do it this way:

  • Meta tags always takes precedence (or xml definition)
  • Encoding defined in the header is used when there is no charset defined in a meta tag
  • If the encoding is not defined at all, than it is time for encoding detection.

(Well... at least that is the way I believe most browsers do it. Documentation is really scarce.)

What I'm looking for is a library that can decide the character set of a page the way a browser would. I'm sure I'm not the first who needs a proper solution to this problem.

Solution (I have not tried it yet...)

According to Beautiful Soup's documentation.

Beautiful Soup tries the following encodings, in order of priority, to turn your document into Unicode:

  • An encoding you pass in as the fromEncoding argument to the soup constructor.
  • An encoding discovered in the document itself: for instance, in an XML declaration or (for HTML documents) an http-equiv META tag. If Beautiful Soup finds this kind of encoding within the document, it parses the document again from the beginning and gives the new encoding a try. The only exception is if you explicitly specified an encoding, and that encoding actually worked: then it will ignore any encoding it finds in the document.
  • An encoding sniffed by looking at the first few bytes of the file. If an encoding is detected at this stage, it will be one of the UTF-* encodings, EBCDIC, or ASCII.
  • An encoding sniffed by the chardet library, if you have it installed.
  • UTF-8
  • Windows-1252
Tarnay Kálmán
  • 6,907
  • 5
  • 46
  • 57
  • 4
    You can't download "any" page with a correct character set. Browsers guess wrong all the time, when the correct charset isn't specified. I use the view->encoding menu in FF to fix incorrect guesses on a daily basis. You want to do as well as you can, but give up on guessing every page correctly. – Glenn Maynard Sep 30 '09 at 02:08
  • 7
    Guessing character sets is evil and has got us into this mess in the first place. If the browsers had never attempted to guess, developers would be forced to learn about HTTP headers and always specify the encoding properly. Guessing means sometime you are going to get it wrong – John La Rooy Oct 04 '09 at 01:04
  • gnibbler, guessing is a last resort – Tarnay Kálmán Oct 09 '09 at 15:46
  • 1
    This may be helpful: http://stackoverflow.com/a/24372670/28324 – Elias Zamaria Jun 23 '14 at 18:19

7 Answers7

37

When you download a file with urllib or urllib2, you can find out whether a charset header was transmitted:

fp = urllib2.urlopen(request)
charset = fp.headers.getparam('charset')

You can use BeautifulSoup to locate a meta element in the HTML:

soup = BeatifulSoup.BeautifulSoup(data)
meta = soup.findAll('meta', {'http-equiv':lambda v:v.lower()=='content-type'})

If neither is available, browsers typically fall back to user configuration, combined with auto-detection. As rajax proposes, you could use the chardet module. If you have user configuration available telling you that the page should be Chinese (say), you may be able to do better.

Martin v. Löwis
  • 124,830
  • 17
  • 198
  • 235
  • 5
    @kaizer.se: right; it's `get_param` in 3.x (but then, it's also urllib.request) – Martin v. Löwis Oct 07 '09 at 18:25
  • Unfortunately (at least in Python 2.7) urllib2 doesn't parse out charset from the Content-Type header, so you'll need to do something like the answer in http://stackoverflow.com/a/1020931/69707 – Ken Arnold Nov 29 '11 at 21:35
  • It is close, but still have a few pieces missing - BOM marks are not taken in account, it is not said how to resolve HTTP header and meta tag ambiguity; encoding names defined in HTTP headers and meta tags don't match names supported by Python stdlib. Using a library function which does all of that (like w3lib.encoding.html_to_unicode) instead of trying to get it right manually is usually a better idea. – Mikhail Korobov May 17 '17 at 10:39
15

Use the Universal Encoding Detector:

>>> import chardet
>>> chardet.detect(urlread("http://google.cn/"))
{'encoding': 'GB2312', 'confidence': 0.99}

The other option would be to just use wget:

  import os
  h = os.popen('wget -q -O foo1.txt http://foo.html')
  h.close()
  s = open('foo1.txt').read()
rajax
  • 730
  • 1
  • 4
  • 13
  • This is no good as it fails sometimes. Also see: http://chardet.feedparser.org/docs/faq.html#faq.yippie (Yippie!) – Tarnay Kálmán Sep 30 '09 at 00:48
  • The main problem with this approach that you ignore the page's explicitly specified character encoding. – Tarnay Kálmán Sep 30 '09 at 00:49
  • 2
    Ok, then there isn't a silver bullet here I'm afraid - so write it yourself. :) – rajax Sep 30 '09 at 00:50
  • For example chardet detects origo.hu as ISO-8859-8 while that page is actually ISO-8859-2 as defined by a meta tag. That would mess things up badly. – Tarnay Kálmán Sep 30 '09 at 00:57
  • Well... and how is foo1.txt encoded? :) The same way the webpage was encoded and we still don't what the encoding is. – Tarnay Kálmán Sep 30 '09 at 01:04
  • And you've lost data, since the very first place to look is the Content-Type HTTP header, which was thrown away. – Glenn Maynard Sep 30 '09 at 01:49
  • 2
    @Kalmi: You link to the chardet faq; less than 10 lines down, he links to feedparser, which does what you want: http://code.google.com/p/feedparser/source/browse/trunk/feedparser/feedparser.py#3133 (Granted, he only handles xml files, but 90% of the machinery you need is in there...) – Stobor Oct 08 '09 at 02:09
  • 1
    @Kalmi - There simply doesn't exist a solution that works every time, since many byte sequences can appear in many encodings. – Jonathan Feinberg Oct 09 '09 at 01:27
  • Stobor, Your answer(um... comment...) is the best so far. :) – Tarnay Kálmán Oct 09 '09 at 19:34
4

It seems like you need a hybrid of the answers presented:

  1. Fetch the page using urllib
  2. Find <meta> tags using beautiful soup or other method
  3. If no meta tags exist, check the headers returned by urllib
  4. If that still doesn't give you an answer, use the universal encoding detector.

I honestly don't believe you're going to find anything better than that.

In fact if you read further into the FAQ you linked to in the comments on the other answer, that's what the author of detector library advocates.

If you believe the FAQ, this is what the browsers do (as requested in your original question) as the detector is a port of the firefox sniffing code.

Gareth Simpson
  • 36,943
  • 12
  • 47
  • 50
  • What I find odd is that there is no existing library/snippet for this. – Tarnay Kálmán Oct 09 '09 at 19:27
  • Stobor pointed out the existence of feedparser.py (which is unfortunately only for XML), but contains most of the things I need. – Tarnay Kálmán Oct 09 '09 at 19:35
  • The algorithm is not correct, as HTTP headers should take precedance over meta tags. It also misses BOM marks and an encoding normalisation step (encoding names in HTML/HTTP are not the same as names provided by Python). – Mikhail Korobov May 17 '17 at 10:27
3

I would use html5lib for this.

vossad01
  • 11,552
  • 8
  • 56
  • 109
Tobu
  • 24,771
  • 4
  • 91
  • 98
  • 2
    This looks really nice. Documentation about how it does its encoding discovery: http://html5lib.readthedocs.org/en/latest/movingparts.html#encoding-discovery – Tarnay Kálmán Dec 11 '13 at 18:53
2

Scrapy downloads a page and detects a correct encoding for it, unlike requests.get(url).text or urlopen. To do so it tries to follow browser-like rules - this is the best one can do, because website owners have incentive to make their websites work in a browser. Scrapy needs to take HTTP headers, <meta> tags, BOM marks and differences in encoding names in account.

Content-based guessing (chardet, UnicodeDammit) on its own is not a correct solution, as it may fail; it should be only used as a last resort when headers or <meta> or BOM marks are not available or provide no information.

You don't have to use Scrapy to get its encoding detection functions; they are released (among with some other stuff) in a separate library called w3lib: https://github.com/scrapy/w3lib.

To get page encoding and unicode body use w3lib.encoding.html_to_unicode function, with a content-based guessing fallback:

import chardet
from w3lib.encoding import html_to_unicode

def _guess_encoding(data):
    return chardet.detect(data).get('encoding')

detected_encoding, html_content_unicode = html_to_unicode(
    content_type_header,
    html_content_bytes,
    default_encoding='utf8', 
    auto_detect_fun=_guess_encoding,
)
Mikhail Korobov
  • 21,908
  • 8
  • 73
  • 65
1

instead of trying to get a page then figuring out the charset the browser would use, why not just use a browser to fetch the page and check what charset it uses..

from win32com.client import DispatchWithEvents
import threading


stopEvent=threading.Event()

class EventHandler(object):
    def OnDownloadBegin(self):
        pass

def waitUntilReady(ie):
    """
    copypasted from
    http://mail.python.org/pipermail/python-win32/2004-June/002040.html
    """
    if ie.ReadyState!=4:
        while 1:
            print "waiting"
            pythoncom.PumpWaitingMessages()
            stopEvent.wait(.2)
            if stopEvent.isSet() or ie.ReadyState==4:
                stopEvent.clear()
                break;

ie = DispatchWithEvents("InternetExplorer.Application", EventHandler)
ie.Visible = 0
ie.Navigate('http://kskky.info')
waitUntilReady(ie)
d = ie.Document
print d.CharSet
Ravi
  • 625
  • 1
  • 6
  • 19
  • just tested this on origo.hu and it works, albeit incredibly slowly - maybe try with the firefox activex component instead – Ravi Sep 30 '09 at 18:45
1

BeautifulSoup dose this with UnicodeDammit : Unicode, Dammit

AlexCV
  • 496
  • 5
  • 4