0

What would be the best programmatic way to grab all the HTML tables of Wikipedia main article pages where the pages' titles match certain keywords? Then I would like to take the column names and table data and put them into a database.

Would also grab the URL and page name for attribution.

I don't need specifics just some recommended methods or links to some tutorials perhaps.

svick
  • 236,525
  • 50
  • 385
  • 514
txchou
  • 647
  • 1
  • 6
  • 15
  • 1
    Rather than scraping, wouldn't you be better off using the api (http://www.mediawiki.org/wiki/API:Main_page). See also... http://stackoverflow.com/questions/627594/is-there-a-wikipedia-api – Verma Jul 31 '13 at 04:25
  • Yup. Sorry, I was using scraping as a general catch-all word. I have looked into the API. – txchou Aug 01 '13 at 04:18
  • Any particular programming language you will be using? – Verma Aug 01 '13 at 04:36

1 Answers1

0

The easy approach to this is not to scrape the wikipedia website at all. All of the data, metadata, and associated media that form Wikipedia are available in structured formats; so preclude any need to scrape their web pages.

To get the data from Wikipedia into your database (which you may then search, slice and dice 'til your heart's content):

  1. Download the data files.
  2. Run the SQLize tool of your choice
  3. Run mysqlimport
  4. Drink a coffee.

The URL of the original article should be able to be re-constructed from the page title pretty easily.

Richard
  • 603
  • 3
  • 14