I'd like to tokenise out wikipedia pages of interest with a python library or libraries. I'm most interested in tables and listings. I want to be able to then import this data into Postgres or Neo4j.
For example, here are three data sets that I'd be interested in:
- How many points each country awarded one another in the Eurovision Song contest of 2008: http://en.wikipedia.org/wiki/Eurovision_Song_Contest_2008#Final
- List of currencies and the countries in which they circulate (a many-to-many relationship): http://en.wikipedia.org/wiki/List_of_circulating_currencies
- Lists of solar plants around the world: http://en.wikipedia.org/wiki/List_of_solar_thermal_power_stations
The source of each of these is written with wikipedia's brand of markup which is used to render them out. There are many wikipedia-specific tags and syntax used in the raw data form. The HTML might almost be the easier solution as I can just use BeautifulSoup.
Anyone know of a better way of tokenizeing? I feel that I'd reinvent the wheel if I took the final HTML and parsing it with BeautifulSoup. Also, if I could find a way to output these pages in XML, the table data might not be tokenized enough and it would require further processing.