Best method to scrape large number of Wikipedia tables to MySQL database

Question

What would be the best programmatic way to grab all the HTML tables of Wikipedia main article pages where the pages' titles match certain keywords? Then I would like to take the column names and table data and put them into a database.

Would also grab the URL and page name for attribution.

I don't need specifics just some recommended methods or links to some tutorials perhaps.

Rather than scraping, wouldn't you be better off using the api (http://www.mediawiki.org/wiki/API:Main_page). See also... http://stackoverflow.com/questions/627594/is-there-a-wikipedia-api — Verma, Jul 31 '13 at 04:25
Yup. Sorry, I was using scraping as a general catch-all word. I have looked into the API. — txchou, Aug 01 '13 at 04:18

Richard · Answer 1 · 2013-08-19T07:07:48.843

0

The easy approach to this is not to scrape the wikipedia website at all. All of the data, metadata, and associated media that form Wikipedia are available in structured formats; so preclude any need to scrape their web pages.

To get the data from Wikipedia into your database (which you may then search, slice and dice 'til your heart's content):

Download the data files.
Run the SQLize tool of your choice
Run mysqlimport
Drink a coffee.

The URL of the original article should be able to be re-constructed from the page title pretty easily.

edited Aug 19 '13 at 07:07

answered Jul 31 '13 at 05:16

Richard

603
3
14

Can anyone add more detail on detecting the HTML tables then converting those into SQL? – txchou Aug 01 '13 at 04:19

Best method to scrape large number of Wikipedia tables to MySQL database

1 Answers1