2

On an Ubuntu platform, I installed the nice little perl script

libtext-mediawikiformat-perl - Convert Mediawiki markup into other text formats

which is available on cpan. I'm not familiar with perl and have no idea how to go about using this library to write a perl script that would convert a mediawiki file to an html file. e.g. I'd like to just have a script I can run such as

./my_convert_script input.wiki > output.html

(perhaps also specifying the base url, etc), but have no idea where to start. Any suggestions?

Nemo
  • 2,441
  • 2
  • 29
  • 63
cboettig
  • 12,377
  • 13
  • 70
  • 113

2 Answers2

2

I believe @amon is correct that perl library I reference in the question is not the right tool for the task I proposed.

I ended up using the mediawiki API with the action="parse" to convert to HTML using the mediawiki engine, which turned out to be much more reliable than any of the alternative parsers I tried proposed on the list. (I then used pandoc to convert my html to markdown.) The mediawiki API handles extraction of categories and other metadata too, and I just had to append the base url to internal image and page links.

Given the page title and base url, I ended up writing this as an R function.

wiki_parse <- function(page, baseurl, format="json", ...){
  require(httr)
  action = "parse"
  addr <- paste(baseurl, "/api.php?format=", format, "&action=", action, "&page=", page, sep="")
  config <- c(add_headers("User-Agent" = "rwiki"), ...)
  out <- GET(addr, config=config)
  parsed_content(out)
}
cboettig
  • 12,377
  • 13
  • 70
  • 113
1

The Perl library Text::MediawikiFormat isn't really intended for stand-alone use but rather as a formatting engine inside a larger application.

The documentation at CPAN does actually show a way how to use this library, and does note that other modules might provide better support for one-off conversions.

You could try this (untested) one-liner

perl -MText::MediawikiFormat -e'$/=undef; print Text::MediawikiFormat::format(<>)' input.wiki >output.html

although that defies the whole point (and customization abilities) of this module.

I am sure that someone has already come up with a better way to convert single MediaWiki files, so here is a list of alternative MediaWiki processors on the mediawiki site. This SO question coud also be of help.

Other markup languages, such as Markdown provide better support for single-file conversions. Markdown is especially well suited for technical documents and mirrors email conventions. (Also, it is used on this site.)


The libfoo-bar-perl packages in the Ubuntu repositories are precompiled Perl modules. Usually, these would be installed via cpan or cpanm. While some of these libraries do include scripts, most don't, and aren't meant as stand-alone applications.

Community
  • 1
  • 1
amon
  • 57,091
  • 2
  • 89
  • 149
  • Thanks for the answer. I started with that very SO question and a list of alternative MediaWiki processors, but after trying a handful that didn't work or supported only some subset of mediawiki syntax I figured I better look somewhere else. (Other answers from that SO aren't useful either since pandoc only goes one way for mediawiki). – cboettig Sep 28 '12 at 00:25
  • I agree entirely that markdown would be better -- my goal here is to pull all my old content off of my old mediawiki pages and convert it all to markdown for this very reason. It is surprising that it is not easier to find such mediawiki tools given that it is much older than markdown. – cboettig Sep 28 '12 at 00:26
  • 1
    @cboettig one possible reason is that there is no real way to convert wikitext to markdown. At most one could convert a subset of wikitext to markdown, and that would still be difficult because wikitext is not a defined markup language. Things might be easier now with [Parsoid](https://www.mediawiki.org/wiki/Parsoid) which made wikitext parsing more "scientific". – Nemo Nov 08 '15 at 09:06