2

I'd like to tokenise out wikipedia pages of interest with a python library or libraries. I'm most interested in tables and listings. I want to be able to then import this data into Postgres or Neo4j.

For example, here are three data sets that I'd be interested in:

The source of each of these is written with wikipedia's brand of markup which is used to render them out. There are many wikipedia-specific tags and syntax used in the raw data form. The HTML might almost be the easier solution as I can just use BeautifulSoup.

Anyone know of a better way of tokenizeing? I feel that I'd reinvent the wheel if I took the final HTML and parsing it with BeautifulSoup. Also, if I could find a way to output these pages in XML, the table data might not be tokenized enough and it would require further processing.

Mark L
  • 12,405
  • 4
  • 28
  • 41
  • [Here](http://www.mediawiki.org/wiki/Alternative_parsers) are some parser for the wiki syntax. There are some Python solutions, but you should choose one which is generating a intermediate representation you can further process. [mediawiki-parser](https://github.com/peter17/mediawiki-parser) looks promising for example. – schlamar May 24 '12 at 12:08
  • [Here's an example that uses mediawiki api to get data as XML.](http://stackoverflow.com/a/8045486/4279) Note: it doesn't tokenize the markup (for a few specific cases it might be simpler to process the raw text rather than a tokenized output of some mediawiki-markup parser). – jfs May 24 '12 at 12:42

2 Answers2

2

Since Wikipedia is built on MediWiki, there is an api you can exploit. There is also Special:Export that you can use.

Once you have the raw data, then you can run it through mwlib to parse it.

Burhan Khalid
  • 169,990
  • 18
  • 245
  • 284
  • Thanks but I was specifically after a Python library which already tokenizes the pages. If I were to start from this API, I'd be writing a ton of code and if I were to be using something like python-wikitools there would still be a lot of parsing to do. – Mark L May 24 '12 at 12:08
  • Note: mwlib is a pain in the royal butt to use if you're not looking to host the code (and you're instead looking to strip out the unneeded markup. Been there. Tried that. Still wake up in cold sweats thinking about it. – Chris Pfohl May 24 '12 at 12:12
  • Thanks for the warming Chris! :D – Mark L May 24 '12 at 12:17
2

This goes more to semantic web direction, but DBPedia allows querying parts (community conversion effort) of wikipedia data with SPARQL. This makes it theoretically straightforward to extract the needed data, however dealing with RDF triples might be cumbersome.

Furthermore, I don't know if DBPedia yet contains any data that is of interest for you.

Stan James
  • 2,535
  • 1
  • 28
  • 35
jhonkola
  • 3,385
  • 1
  • 17
  • 32
  • Thanks, I had a look and it doesn't contain any table data. It seems more interested in the article's structure rather than the content. http://dbpedia.org/page/Eurovision_Song_Contest_2008 doesn't have the points awarded table you'd find on http://en.wikipedia.org/wiki/Eurovision_Song_Contest_2008#Final – Mark L May 24 '12 at 12:16
  • @MarkL On closer look, you are correct. I would actually say that the project participants are probably most interested in classifying things and making relations between things explicit (this being related to semantic web) than sets of data. – jhonkola May 24 '12 at 12:28