40

I try to fetch a Wikipedia article with Python's urllib:

f = urllib.urlopen("http://en.wikipedia.org/w/index.php?title=Albert_Einstein&printable=yes")           
s = f.read()
f.close()

However instead of the html page I get the following response: Error - Wikimedia Foundation:

Request: GET http://en.wikipedia.org/w/index.php?title=Albert_Einstein&printable=yes, from 192.35.17.11 via knsq1.knams.wikimedia.org (squid/2.6.STABLE21) to ()
Error: ERR_ACCESS_DENIED, errno [No Error] at Tue, 23 Sep 2008 09:09:08 GMT 

Wikipedia seems to block request which are not from a standard browser.

Anybody know how to work around this?

unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
dkp
  • 825
  • 3
  • 9
  • 14
  • 3
    Wikipedia doesn't block requests are not from a standard browser, it blocks requests that are from standard libraries without changing their user agent. – svick Aug 05 '12 at 08:09

10 Answers10

50

You need to use the urllib2 that superseedes urllib in the python std library in order to change the user agent.

Straight from the examples

import urllib2
opener = urllib2.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
infile = opener.open('http://en.wikipedia.org/w/index.php?title=Albert_Einstein&printable=yes')
page = infile.read()
octosquidopus
  • 3,517
  • 8
  • 35
  • 53
Florian Bösch
  • 27,420
  • 11
  • 48
  • 53
  • 7
    Wikipedia attempts to block screen scrapers for a reason. Their servers have to do a lot of work to convert wikicode to HTML, when there are easier ways to get the article content. http://en.wikipedia.org/wiki/Wikipedia:Database_download#Please_do_not_use_a_web_crawler – Cerin Aug 12 '10 at 17:49
  • 2
    You shouldn't try to impersonate a browser by using a user agent like `Mozilla/5.0`. Instead, [you should use an informative user agent with some contact information](http://meta.wikimedia.org/wiki/User-Agent_policy). – svick Aug 05 '12 at 08:08
37

It is not a solution to the specific problem. But it might be intersting for you to use the mwclient library (http://botwiki.sno.cc/wiki/Python:Mwclient) instead. That would be so much easier. Especially since you will directly get the article contents which removes the need for you to parse the html.

I have used it myself for two projects, and it works very well.

Hannes Ovrén
  • 21,229
  • 9
  • 65
  • 75
  • 4
    Using third party libraries for what can easily be done with buildin libraries in a couple lines of code isn't good advice. – Florian Bösch Sep 23 '08 at 10:18
  • 17
    Since mwclient uses the mediawiki api it will require no parsing of the content. And I am guessing the original poster wants the content, and not the raw html with menus and all. – Hannes Ovrén Sep 23 '08 at 10:52
15

Rather than trying to trick Wikipedia, you should consider using their High-Level API.

sligocki
  • 6,246
  • 5
  • 38
  • 47
  • Which will, in turn, still block requests from `urllib` using the library default user-agent header. So the OP will still have the very same problem, although the API may be an easier way to interface the wiki content, depending on what are the OP goals. – njsg Feb 16 '12 at 10:26
  • They work fine for me. Don't they work for you? Ex: http://en.wikipedia.org/w/api.php?format=xml&action=query&prop=info&titles=Main%20Page or http://en.wikipedia.org/w/api.php?format=xml&action=query&titles=Main%20Page&prop=revisions&rvprop=content – sligocki Feb 22 '12 at 20:35
3

In case you are trying to access Wikipedia content (and don't need any specific information about the page itself), instead of using the api you should just call index.php with 'action=raw' in order to get the wikitext, like in:

'http://en.wikipedia.org/w/index.php?action=raw&title=Main_Page'

Or, if you want the HTML code, use 'action=render' like in:

'http://en.wikipedia.org/w/index.php?action=render&title=Main_Page'

You can also define a section to get just part of the content with something like 'section=3'.

You could then access it using the urllib2 module (as sugested in the chosen answer). However, if you need information about the page itself (such as revisions), you'll be better using the mwclient as sugested above.

Refer to MediaWiki's FAQ if you need more information.

mathias
  • 31
  • 2
  • hello if I do not know the section number as 3 but I know the section title to be 'Noun', how to get that particular section? – Raj Feb 23 '11 at 14:06
2

The general solution I use for any site is to access the page using Firefox and, using an extension such as Firebug, record all details of the HTTP request including any cookies.

In your program (in this case in Python) you should try to send a HTTP request as similar as necessary to the one that worked from Firefox. This often includes setting the User-Agent, Referer and Cookie fields, but there may be others.

Liam
  • 19,819
  • 24
  • 83
  • 123
2

requests is awesome!

Here is how you can get the html content with requests:

import requests
html = requests.get('http://en.wikipedia.org/w/index.php?title=Albert_Einstein&printable=yes').text

Done!

Aziz Alto
  • 19,057
  • 5
  • 77
  • 60
1

Try changing the user agent header you are sending in your request to something like: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008072820 Ubuntu/8.04 (hardy) Firefox/3.0.1 (Linux Mint)

Vasil
  • 36,468
  • 26
  • 90
  • 114
1

You don't need to impersonate a browser user-agent; any user-agent at all will work, just not a blank one.

Gurch
  • 49
  • 2
  • 4
    urllib and urllib2 both send a user agent – Teifion Sep 23 '08 at 09:58
  • 2
    `s/blank/blank or default/` — the idea is exactly that you should somehow identify your bot through the user-agent header. That's why they block the `urllib` default one. – njsg Feb 16 '12 at 10:29
1

Requesting the page with ?printable=yes gives you an entire relatively clean HTML document. ?action=render gives you just the body HTML. Requesting to parse the page through the MediaWiki action API with action=parse likewise gives you just the body HTML but would be good if you want finer control, see parse API help.

If you just want the page HTML so you can render it, it's faster and better is to use the new RESTBase API, which returns a cached HTML representation of the page. In this case, https://en.wikipedia.org/api/rest_v1/page/html/Albert_Einstein.

As of November 2015, you don't have to set your user-agent, but it's strongly encouraged. Also, nearly all Wikimedia wikis require HTTPS, so avoid a 301 redirect and make https requests.

skierpage
  • 2,514
  • 21
  • 19
0
import urllib
s = urllib.urlopen('http://en.wikipedia.org/w/index.php?action=raw&title=Albert_Einstein').read()

This seems to work for me without changing the user agent. Without the "action=raw" it does not work for me.

Finn Årup Nielsen
  • 6,130
  • 1
  • 33
  • 43