2

It is easy to download dumps of Wikipedia in XML format. However, the content of the articles are written in wikitext, which has a template system. To extract clean full texts from these dumps, it is necessary to expand these templates. Wikipedia provides an API to do so but it is not suitable for expanding an entire dump. Several scripts can be found to deal with wikitext, such as this one written in python, but they all seems outdated or simply don't deal with templates. Another way of tackling this problem would be to run Wikimedia on a computer and use the API:Expandtemplates but it seems to be a quite cumbersome solution. Finally, HTML dumps also exist, but I prefer to work with expanded wikitexts since it makes it easier to deal with wikilinks, tables, sections etc.

My goal here is to extract clean texts while keeping the wikilinks and discarding complicated templates such as info-boxes. Do you have any idea how to tackle this template expansion problem ?

Robin
  • 1,531
  • 1
  • 15
  • 35
  • 1
    This is pretty much hopeless, templates are their own programming language, and not a well-documented one. If simply discarding templates is not sufficient (mwparserfromhell could do that), your best bet is probably to make or find a dump of the HTML output by the [new parser](https://www.mediawiki.org/wiki/Specs/HTML/2.2.0), and rely on the semantic information encoded in it to convert it back to plaintext with link annotations. – Tgr Mar 11 '21 at 09:29
  • That could be a good option, thanks for the link to the specs. Do you have any idea how I could find an HTML dump ? [The latest dumps date back to 2008](https://dumps.wikimedia.org/other/static_html_dumps/)... – Robin Mar 11 '21 at 11:21
  • Also, do you know of a good documentation on the "programming language" used for the templates ? – Robin Mar 11 '21 at 11:29
  • 1
    Theoretically, HTML dumps will be available [some day soon](https://www.mediawiki.org/wiki/Okapi#Alpha_-_%22Okapi_HTML_Dumps%22). For now you'd have to crawl the site I think. Templates are documented [here](https://www.mediawiki.org/wiki/Help:Templates) but that's just the framework, they can contain just about anything from special commands pulling content from another wiki to embedded Lua code. As I said, it's pretty much hopeless to reimplement. – Tgr Mar 11 '21 at 16:57

2 Answers2

2

I made a solution that uses Kiwix to get clean texts from Wikipedia. The HTML produced by Kiwix seems easy to parse for my purpose. I don't make the code available anymore (didn't have time to make something shareable). But it turned out to be effective and fast to implement.

Robin
  • 1,531
  • 1
  • 15
  • 35
1

I believe that https://github.com/tatuylonen/wikitextprocessor/ does what you want:

This is a Python package for processing WikiMedia dump files for Wiktionary, Wikipedia, etc., for data extraction, error checking, offline conversion into HTML or other formats, and other uses. Key features include:

  • Parsing dump files, including built-in support for processing pages in parallel
  • Wikitext syntax parser that converts the whole page into a parse tree
  • Extracting template definitions and Scribunto Lua module definitions from dump files
  • Expanding selected templates or all templates, and heuristically identifying templates that need to be expanded before parsing is reasonably possible (e.g., templates that emit table start and end tags)
  • Processing and expanding wikitext parser functions
  • Processing, executing, and expanding Scribunto Lua modules (they are very widely used in, e.g., Wiktionary, for example for generating IPA strings for many languages)
  • Controlled expansion of parts of pages for applications that parse overall page structure before parsing but then expand templates on certain sections of the page
  • Capturing information from template arguments while expanding them, as template arguments often contain useful information not available in the expanded content.
Palmik
  • 2,675
  • 16
  • 13