8

I'm interested in selectively parsing Mediawiki XML markup to generate a customized HTML page that's some subset of the HTML produced by the actual PHP Mediawiki render engine.

I want it for BzReader, an offline Mediawiki compressed dump reader written in C#. So a C# parser would be ideal, but any good code would help.

Of course, if no one has done it before, I guess it's time to start a project maintaining a free and separate Mediawiki parser, based on Mediawiki's own parser, but less tightly integrated with Mediawiki itself.

So, does anyone know of any base I could begin with, that would be better than hacking from the Mediawiki PHP code?

Asaf Bartov
  • 2,711
  • 3
  • 20
  • 18

3 Answers3

7

There is a list of parsers on http://www.mediawiki.org/wiki/Alternative_parsers, but a c# parser is not included there...

wimh
  • 15,072
  • 6
  • 47
  • 98
  • For .net integration, he could use iron python though. – Dana the Sane Nov 28 '08 at 02:44
  • I gave up after a few hours trying to use iron python with those python libraries. Too complicated... – jjxtra Aug 28 '11 at 20:41
  • This list is old and not updated. – ALOToverflow Jan 06 '12 at 02:42
  • 2
    @Francis, looking at the history it seems updated less than a month ago (PHP5 WP was added). But you are probably right that it is not complete, and projects which don't exist any more are still in the list. As with all Wikimedia projects, anybody can edit that page to improve it. That does not mean someone feels responsible to keep it fully up-to-date. – wimh Jan 06 '12 at 08:03
7

Update
Bare in mind Screwturn doesn't stick to the Mediawiki syntax but uses its own variation which does vary a bit.

The Mediawiki syntax doesn't lend itself to LALR parser (or even LL*) as it has a lot of ambiguities in its definition, and also allows HTML. There's a discussion on that in this question, you're essentially stuck with writing your own parser and tokenizer rather than simply writing a BNF file for it and then using ANTLR/Gold/Irony.

Roadkill Wiki uses a Creole parser for its Mediawiki parsing, but with limited support.


Screwturn is released under the GPL license, and has a C# parser:

The class you are after is Core.Formatter which has lots of regexs to do its work:

public static class Formatter {

}

It's not the nicest looking code "but it works".

Community
  • 1
  • 1
Chris S
  • 64,770
  • 52
  • 221
  • 239
4

I had some words to say about Mediawiki templates here. Interesting that there's a list of alternative parsers now, I'll have to investigate that.

Greg Hewgill
  • 951,095
  • 183
  • 1,149
  • 1,285