Error crawling wikipedia

Question

According to the answer by @Jens Timmerman on this post: Extract the first paragraph from a Wikipedia article (Python)

i did this:

import urllib2
def getPage(url):
    opener = urllib2.build_opener()
    opener.addheaders = [('User-agent', 'Mozilla/5.0')] #wikipedia needs this

    resource = opener.open("http://en.wikipedia.org/wiki/" + url)
    data = resource.read()
    resource.close()
    return data

print getPage('Steve_Jobs')

technically it should run properly and give me the source of the page. but here's what i get:

enter image description here

any help would be appreciated..

Why crawl Wikipedia if you can use their [API](http://www.mediawiki.org/wiki/API)? — NullUserException, Nov 03 '13 at 02:30
@NullUserException, I'm sorry, but I hate comments like yours. The OP wants to do it using `python`, can we please just focus on helping him achieve that instead of suggesting alternative methods? — Maria Ines Parnisari, Nov 03 '13 at 02:47
@l19: NullUserException is perfectly right; the Wikipedia APIs can be used from Python (actually, that's one of the most common scenarios), since they are just simple HTTP requests like the one we are talking about now. The difference is that they are typically more flexible and the data returned is normally in machine-readable format, which is typically a big plus for our script *and* for wikipedia servers, which don't have to waste time rendering MediaWiki markup. — Matteo Italia, Nov 03 '13 at 02:52
@l19 APIs are specifically designed for this purpose, so you don't have to crawl the website. Like Matteo said, this benefits both wikipedia *and* you. Actually, some websites explicitly forbid you from crawling them while allowing you access through an API. I don't think Wikipedia is one of them, but their [robots.txt](http://en.wikipedia.org/robots.txt) shows they aren't exactly very fond of crawling. If you're accessing someone's application, respect the developers' wishes and do it the way they want it done, through the API. — NullUserException, Nov 03 '13 at 07:18

score 2 · Answer 1 · edited May 23 '17 at 12:13

After checking with wget and curl, I saw that it wasn't a problem specific to Python - they too got "strange" characters; a quick check with file tells me that the response is simply gzip-compressed, so it seems that Wikipedia just sends gzipped data by default, without checking if the client actually says to support it in the request.

Fortunately, Python is capable of decompressing gzipped data: integrating your code with this answer you get:

import urllib2
from StringIO import StringIO
import gzip

def getPage(url):
    opener = urllib2.build_opener()
    opener.addheaders = [('User-agent', 'MyTestScript/1.0 (contact at myscript@mysite.com)'), ('Accept-encoding', 'gzip')]
    resource = opener.open("http://en.wikipedia.org/wiki/" + url)
    if resource.info().get('Content-Encoding') == 'gzip':
        buf = StringIO( resource.read())
        f = gzip.GzipFile(fileobj=buf)
        return f.read()
    else:
        return resource.read()

print getPage('Steve_Jobs')

which works just fine on my machine.

Still, as already pointed out in the comments, you should probably avoid "brutal crawling", if you want to access Wikipedia content programmatically use their APIs.

Error crawling wikipedia

1 Answers1