0

I know the question might be simpler than it seems, but after reading tons of material, I'm really confused.

So, I have downloaded a wikipedia dump (this one to be precise: enwiktionary-20151002-pages-articles-multistream.xml.bz2 - which supposedly contains all articles from the English Wiktionary). What I want is to get the content of a specific article, by title (the same way you would search it in Wikipedia itself).

Note: I don't want the HTML (as generated by wikipedia). I want the "real" content, as you see it when "edit"ing any article in Wikipedia.

In a few words:

  • Search for the article with the title, e.g. "book"
  • Get the content

How should I go about that?


P.S. I'm not looking for a language-specific solution. I just need some ideas as to how this can be approached.

Dr.Kameleon
  • 22,532
  • 20
  • 115
  • 223
  • 1
    Not familiar with the wikipedia dump syntax, can you post a sample, or a link to docs – Steve Oct 14 '15 at 13:21
  • With almost 12k rep you should know these questions fall into the **"primarily opinion-based"** category – Pedro Lobito Oct 14 '15 at 13:24
  • I would start by uncompressing the `bz2` file as there does not appear to be a way to process it in compressed format – RiggsFolly Oct 14 '15 at 13:24
  • @PedroLobito Well, how is this "opinion-based"? I know there might be 10 different possible approaches (as there are in almost anything programming-related). I just need *one*. – Dr.Kameleon Oct 14 '15 at 13:25
  • @RiggsFolly Lol. I guess I've already gone past this part... :) – Dr.Kameleon Oct 14 '15 at 13:25
  • **" how is this "opinion-based"? I know there might be 10 different possible approaches"** You answered yourself. – Pedro Lobito Oct 14 '15 at 13:26
  • LOL - _Well that was not obvious from your question_. Ok if it is extracted to an XML file then you **could** use PHP's [XMLReader](http://php.net/manual/en/book.xmlreader.php). I would suggest sticking to a `XML Pull parser` rather than using `SimpleXML` as I assume the XML file is quite large. Of course a database would probably be easier **and quicker** in the long run – RiggsFolly Oct 14 '15 at 16:05
  • @RiggsFolly: The compression stream wrapper is a way in PHP to process it compressed. In any case this question is a dupe of how to parse and process XML with PHP. Also asking for "just some ideas" doesn't work well with Stackoverflow. – hakre Oct 14 '15 at 16:12

1 Answers1

0

If you're only after a short bit of information, you could use Wikipedias JSON API... https://en.wikipedia.org/w/api.php?format=json&action=query&prop=extracts&exintro=&explaintext=&titles=book

If you want the full article, then I believe you can use this: https://en.wikipedia.org/w/api.php?format=json&action=query&prop=extracts&explaintext=&titles=book

The difference between these two sources is that in the first one, we set 'exintro' and in the second link we have 'explaintext'. The sections are split up using "\n\n\n===" and "===\n". With this information, you can pick out the ending of a section and find the start of a new one, along with the section name.

For more info, check out https://en.wikipedia.org/w/api.php?action=help&modules=query%2Bextracts

Sorry this isn't in XML.

IsThisJavascript
  • 1,726
  • 2
  • 16
  • 25
  • The problem with this one is that it makes use of the online wikipedia. Or it would need a quasi-complete offline mediawiki installation. The reason for having downloaded the dumps and wanting to do it offline is speed. (I need to perform some massive processing) – Dr.Kameleon Oct 14 '15 at 13:28
  • 1
    @Dr.Kameleon Well it appears they offer sql dumps as well - i would suggest using that, and then querying using regular sql. Its going to be a lot more efficient that parsing XML with such a huge data set, unless you have a lot of ram and can read the whole thing into memory – Steve Oct 14 '15 at 13:49