How to deal with unknown encoding when scraping webpages?

Question

I'm scraping news articles from various sites, using GAE and Python.

The code where I scrape one article url at a time leads to the following error:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 8858: ordinal not in range(128)

Here's my code in its simplest form:

from google.appengine.api import urlfetch

def fetch(url):
    headers = {'User-Agent' : "Chrome/11.0.696.16"}
    result = urlfetch.fetch(url,headers)
    if result.status_code == 200:
        return result.content

Here is another variant I have tried, with the same result:

def fetch(url):
    headers = {'User-Agent' : "Chrome/11.0.696.16"}
    result = urlfetch.fetch(url,headers)
    if result.status_code == 200:
        s = result.content
        s = s.decode('utf-8')
        s = s.encode('utf-8')
        s = unicode(s,'utf-8')
        return s

Here's the ugly, brittle one, which also doesn't work:

def fetch(url):
    headers = {'User-Agent' : "Chrome/11.0.696.16"}
    result = urlfetch.fetch(url,headers)
    if result.status_code == 200:
        s = result.content

        try:
            s = s.decode('iso-8859-1')
        except:
            pass
        try:
            s = s.decode('ascii')
        except: 
            pass
        try:
            s = s.decode('GB2312')
        except:
            pass
        try:
            s = s.decode('Windows-1251')
        except:
            pass
        try:
            s = s.decode('Windows-1252')
        except:
            s = "did not work"

        s = s.encode('utf-8')
        s = unicode(s,'utf-8')
        return s

The last variant returns s as the string "did not work" from the last except.

So, am I going to have to expand my clumsy try/except construction to encompass all possible encodings (will that even work?), or is there an easier way?

Why have I decided to scrape the entire html, not just the BeautifulSoup? Because I want to do the soupifying later, to avoid DeadlineExceedError in GAE.

Have I read all the excellent articles about Unicode, and how it should be done? Yes. However, I have failed to find a solution that does not assume I know the incoming encoding, which I don't, since I'm scraping different sites every day.

The `encode`-`unicode` cycle in both your examples does nothing at all. The second example is broken because having decoded a string using an arbitrary encoding, it repeatedly tries to `decode` it again. `decode`ing an already-`unicode` string is senseless. Ultimately you are guessing encodings, which is an area of considerable historical browser inconsistency, but you will do much better by trying the page's stated encoding first, as in moliware's answer. — bobince, Aug 16 '13 at 08:19

score 2 · Accepted Answer · answered Aug 15 '13 at 23:11

2

I had the same problem some time ago and there is nothing 100% accurate. What I did was:

Get encoding from Content-Type
Get encoding from meta tags
Detect encoding with chardet Python module
Decode text from the most common encoding to Unicode
Process the text/html

answered Aug 15 '13 at 23:11

moliware

10,160
3
37
47

thanks! the chardet module is very useful, and the content-type tag is a good backup. – memius Oct 15 '13 at 21:09

score 1 · Answer 2 · answered Aug 15 '13 at 22:49

It is better to simply read Content-Type in meta tags or headers. Please note, that Chrome (opposite to Opera) does not guess encoding. If it is not said that it is UTF8 or anything else in either of places, it threads site as windows default encoded. So only really bad sites do not define it.

How to deal with unknown encoding when scraping webpages?

2 Answers2

Linked