7

I stumbled upon the wikidump python library, which I think suits me just fine.

I could get by by looking at the source code, but I'm new at python and I don't want to write BS code as the project I need it for is kind of important to me.

I got the 'wiki-SPECIFICDATE-pages-articles.xml.bz2' file and I would need to use that as my source for single article fetching. Can anyone give me some pointers as to properly achieve this or, even better, point at some documentation? I couldn't find any!

(p.s. if you got any better and properly doc'd lib, please tell me)

the
  • 21,007
  • 11
  • 68
  • 101
Riccardo
  • 109
  • 7
  • 1
    Have you looked at their command-line client at https://github.com/saffsd/wikidump/blob/master/src/wikidump/__init__.py that can be used as an example? – MaxSem Apr 25 '14 at 01:06
  • 1
    I use http://medialab.di.unipi.it/wiki/Wikipedia_Extractor to convert Wikipedia to plain text. It can be modified easily to fetch any article. Just debug one article's processing and you will see where to insert a regex match for fetching. – Vadim Oct 06 '14 at 20:45

1 Answers1

0

Not sure if I understand the question, but if you have the Wikipedia dump and you need to parse the wikicode, I would suggest mwparserfromhell lib.

Another powerful framework is Pywikibot, that is the historic framework for bot users on Wikipedia (thus, it has many scripts dedicated to writing pages, instead of reading and parsing articles). It has a lot of documentation (though, sometimes obsolete) and it uses MediaWiki API.

You can use them both, of course: PWB for fetching articles and mwparserfromhell for parsing.

Aubrey
  • 507
  • 4
  • 20