Fetch a Wikipedia article with Python

Question

I try to fetch a Wikipedia article with Python's urllib:

f = urllib.urlopen("http://en.wikipedia.org/w/index.php?title=Albert_Einstein&printable=yes")           
s = f.read()
f.close()

However instead of the html page I get the following response: Error - Wikimedia Foundation:

Request: GET http://en.wikipedia.org/w/index.php?title=Albert_Einstein&printable=yes, from 192.35.17.11 via knsq1.knams.wikimedia.org (squid/2.6.STABLE21) to ()
Error: ERR_ACCESS_DENIED, errno [No Error] at Tue, 23 Sep 2008 09:09:08 GMT

Wikipedia seems to block request which are not from a standard browser.

Anybody know how to work around this?

Wikipedia doesn't block requests are not from a standard browser, it blocks requests that are from standard libraries without changing their user agent. — svick, Aug 05 '12 at 08:09

score 50 · Accepted Answer · edited Dec 08 '14 at 22:34

50

You need to use the urllib2 that superseedes urllib in the python std library in order to change the user agent.

Straight from the examples

import urllib2
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
infile = opener.open('http://en.wikipedia.org/w/index.php?title=Albert_Einstein&printable=yes')
page = infile.read()

edited Dec 08 '14 at 22:34

octosquidopus

3,517
8
35
53

answered Sep 23 '08 at 09:50

Florian Bösch

27,420
11
48
53

7

Wikipedia attempts to block screen scrapers for a reason. Their servers have to do a lot of work to convert wikicode to HTML, when there are easier ways to get the article content. http://en.wikipedia.org/wiki/Wikipedia:Database_download#Please_do_not_use_a_web_crawler – Cerin Aug 12 '10 at 17:49
2

You shouldn't try to impersonate a browser by using a user agent like `Mozilla/5.0`. Instead, [you should use an informative user agent with some contact information](http://meta.wikimedia.org/wiki/User-Agent_policy). – svick Aug 05 '12 at 08:08

Hannes Ovrén · Answer 2 · 2008-09-23T10:53:36.497

37

It is not a solution to the specific problem. But it might be intersting for you to use the mwclient library (http://botwiki.sno.cc/wiki/Python:Mwclient) instead. That would be so much easier. Especially since you will directly get the article contents which removes the need for you to parse the html.

I have used it myself for two projects, and it works very well.

edited Sep 23 '08 at 10:53

answered Sep 23 '08 at 09:49

Hannes Ovrén

21,229
9
65
75

4

Using third party libraries for what can easily be done with buildin libraries in a couple lines of code isn't good advice. – Florian Bösch Sep 23 '08 at 10:18
17

Since mwclient uses the mediawiki api it will require no parsing of the content. And I am guessing the original poster wants the content, and not the raw html with menus and all. – Hannes Ovrén Sep 23 '08 at 10:52

score 15 · Answer 3 · answered Jun 11 '09 at 11:14

15

Rather than trying to trick Wikipedia, you should consider using their High-Level API.

answered Jun 11 '09 at 11:14

sligocki

6,246
5
38
47

Which will, in turn, still block requests from `urllib` using the library default user-agent header. So the OP will still have the very same problem, although the API may be an easier way to interface the wiki content, depending on what are the OP goals. – njsg Feb 16 '12 at 10:26
They work fine for me. Don't they work for you? Ex: http://en.wikipedia.org/w/api.php?format=xml&action=query&prop=info&titles=Main%20Page or http://en.wikipedia.org/w/api.php?format=xml&action=query&titles=Main%20Page&prop=revisions&rvprop=content – sligocki Feb 22 '12 at 20:35

score 3 · Answer 4 · edited Jan 15 '12 at 21:36

In case you are trying to access Wikipedia content (and don't need any specific information about the page itself), instead of using the api you should just call index.php with 'action=raw' in order to get the wikitext, like in:

'http://en.wikipedia.org/w/index.php?action=raw&title=Main_Page'

Or, if you want the HTML code, use 'action=render' like in:

'http://en.wikipedia.org/w/index.php?action=render&title=Main_Page'

You can also define a section to get just part of the content with something like 'section=3'.

You could then access it using the urllib2 module (as sugested in the chosen answer). However, if you need information about the page itself (such as revisions), you'll be better using the mwclient as sugested above.

Refer to MediaWiki's FAQ if you need more information.

hello if I do not know the section number as 3 but I know the section title to be 'Noun', how to get that particular section? — Raj, Feb 23 '11 at 14:06

score 2 · Answer 5 · answered Sep 23 '08 at 09:51

The general solution I use for any site is to access the page using Firefox and, using an extension such as Firebug, record all details of the HTTP request including any cookies.

In your program (in this case in Python) you should try to send a HTTP request as similar as necessary to the one that worked from Firefox. This often includes setting the User-Agent, Referer and Cookie fields, but there may be others.

score 2 · Answer 6 · answered Sep 19 '14 at 05:37

2

requests is awesome!

Here is how you can get the html content with requests:

import requests
html = requests.get('http://en.wikipedia.org/w/index.php?title=Albert_Einstein&printable=yes').text

Done!

answered Sep 19 '14 at 05:37

Aziz Alto

19,057
5
77
60

score 1 · Answer 7 · answered Sep 23 '08 at 09:41

1

Try changing the user agent header you are sending in your request to something like: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008072820 Ubuntu/8.04 (hardy) Firefox/3.0.1 (Linux Mint)

answered Sep 23 '08 at 09:41

Vasil

36,468
26
90
114

score 1 · Answer 8 · answered Sep 23 '08 at 09:48

1

You don't need to impersonate a browser user-agent; any user-agent at all will work, just not a blank one.

answered Sep 23 '08 at 09:48

Gurch

49
2

4

urllib and urllib2 both send a user agent – Teifion Sep 23 '08 at 09:58
2

`s/blank/blank or default/` — the idea is exactly that you should somehow identify your bot through the user-agent header. That's why they block the `urllib` default one. – njsg Feb 16 '12 at 10:29

score 1 · Answer 9 · answered Nov 11 '15 at 05:56

Requesting the page with ?printable=yes gives you an entire relatively clean HTML document. ?action=render gives you just the body HTML. Requesting to parse the page through the MediaWiki action API with action=parse likewise gives you just the body HTML but would be good if you want finer control, see parse API help.

If you just want the page HTML so you can render it, it's faster and better is to use the new RESTBase API, which returns a cached HTML representation of the page. In this case, https://en.wikipedia.org/api/rest_v1/page/html/Albert_Einstein.

As of November 2015, you don't have to set your user-agent, but it's strongly encouraged. Also, nearly all Wikimedia wikis require HTTPS, so avoid a 301 redirect and make https requests.

score 0 · Answer 10 · answered Jan 25 '11 at 15:02

0

import urllib
s = urllib.urlopen('http://en.wikipedia.org/w/index.php?action=raw&title=Albert_Einstein').read()

This seems to work for me without changing the user agent. Without the "action=raw" it does not work for me.

answered Jan 25 '11 at 15:02

Finn Årup Nielsen

6,130
1
33
43

Fetch a Wikipedia article with Python

10 Answers10

Linked

Related