Parsing HTML of Wikipedia pages

Asked Apr 15 '16 at 08:49

Active Jun 18 '16 at 17:21

Viewed 140 times

I'd like to download all the Wikipedia pages in HTML.

Wikipedia API does an excellent job in fetching an wiki article in HTML, provided search title is mentioned.

I use the below snippet to extract a wiki article given the title:

title = 'Barack Obama'
ny = wikipedia.page(title)
data = urllib.urlopen(ny.url)
htmlSource = data.read()

The above snippet gives me the wiki page (HTML) on Barack Obama

I need the HTML file specifically because I have written some regex's to extract relevant information from the page.

I'd be glad if anybody could help me accomplish this task.

edited Jun 18 '16 at 17:21

Nemo

asked Apr 15 '16 at 08:49

Sam

You're using the (incredibly confusingly named) [wikipedia library](https://pypi.python.org/pypi/wikipedia/), right? – svick Apr 15 '16 at 13:11
Also, you'd be much better downloading the single XML dump, instead of downloading the HTML page by page. – svick Apr 15 '16 at 13:16
Ya Exactly, But as I mentioned, I have some regex that helps me parse patterns in the HTML file and retrieve relevant information. I'd be glad, if you could tell me how to get a complete HTML dump – Sam Apr 15 '16 at 13:50
1

[I believe there is no simple way to get that.](https://www.mediawiki.org/wiki/Extension:DumpHTML#Beware.2C_cowboy.21) – svick Apr 15 '16 at 13:59
What sort of "relevant information"? If you mean infobox and the like, please see http://stackoverflow.com/a/33862337/1333493 – Nemo Jun 10 '16 at 06:08
[You should also not be using regexes to parse HTML](http://stackoverflow.com/a/1732454/1306662) – Krenair Nov 06 '16 at 06:58

0 Answers0