1

I'd like to download all the Wikipedia pages in HTML.

Wikipedia API does an excellent job in fetching an wiki article in HTML, provided search title is mentioned.

I use the below snippet to extract a wiki article given the title:

title = 'Barack Obama'
ny = wikipedia.page(title)
data = urllib.urlopen(ny.url)
htmlSource = data.read()

The above snippet gives me the wiki page (HTML) on Barack Obama

I need the HTML file specifically because I have written some regex's to extract relevant information from the page.

I'd be glad if anybody could help me accomplish this task.

Nemo
  • 2,441
  • 2
  • 29
  • 63
Sam
  • 2,545
  • 8
  • 38
  • 59
  • You're using the (incredibly confusingly named) [wikipedia library](https://pypi.python.org/pypi/wikipedia/), right? – svick Apr 15 '16 at 13:11
  • Also, you'd be much better downloading the single XML dump, instead of downloading the HTML page by page. – svick Apr 15 '16 at 13:16
  • Ya Exactly, But as I mentioned, I have some regex that helps me parse patterns in the HTML file and retrieve relevant information. I'd be glad, if you could tell me how to get a complete HTML dump – Sam Apr 15 '16 at 13:50
  • 1
    [I believe there is no simple way to get that.](https://www.mediawiki.org/wiki/Extension:DumpHTML#Beware.2C_cowboy.21) – svick Apr 15 '16 at 13:59
  • What sort of "relevant information"? If you mean infobox and the like, please see http://stackoverflow.com/a/33862337/1333493 – Nemo Jun 10 '16 at 06:08
  • [You should also not be using regexes to parse HTML](http://stackoverflow.com/a/1732454/1306662) – Krenair Nov 06 '16 at 06:58

0 Answers0