3

I'm trying to change wikitext into normal text using Python regular expressions substitution. There are two formatting rules regarding wiki link.

  • [[Name of page]]
  • [[Name of page | Text to display]]

    (http://en.wikipedia.org/wiki/Wikipedia:Cheatsheet)

Here is some text that gives me a headache.

The CD is composed almost entirely of [[cover version]]s of [[The Beatles]] songs which George Martin [[record producer|produced]] originally.

The text above should be changed into:

The CD is composed almost entirely of cover versions of The Beatles songs which George Martin produced originally.

The conflict between [[ ]] and [[ | ]] grammar is my main problem. I don't need one complex regular expression. Applying multiple (maybe two) regular expression substitution(s) in sequence is ok.

Please enlighten me on this problem.

dda
  • 6,030
  • 2
  • 25
  • 34
redism
  • 500
  • 7
  • 18
  • There is a parser for this: http://wiki.sheep.art.pl/Wiki%20Creole%20Parser%20in%20Python – Jochen Ritzel Feb 08 '11 at 03:42
  • Having one regex which can parse both the [[ ]] and the [[ | ]] grammars is not really more complicated than just the one for the [[ | ]] grammar, so you might as well just have the one. – mgiuca Feb 08 '11 at 03:42

4 Answers4

7
wikilink_rx = re.compile(r'\[\[(?:[^|\]]*\|)?([^\]]+)\]\]')
return wikilink_rx.sub(r'\1', the_string)

Example: http://ideone.com/7oxuz

Note: you may also find some MediaWiki parsers in http://www.mediawiki.org/wiki/Alternative_parsers.

kennytm
  • 510,854
  • 105
  • 1,084
  • 1,005
  • Doesn't support links with a "]" character inside. Don't know if that is part of the MediaWiki syntax, but my answer does allow these (at the expense of being quite a bit harder to read!) – mgiuca Feb 08 '11 at 03:32
2

You're going down the wrong path. Wiki markup is notoriously hard to parse, and there are so many exceptions, edge cases and just plain busted markup that building your own regexps to do it is near-impossible. Since you're using Python, I'd suggest mwlib, which will do the hard work for you:

http://code.pediapress.com/wiki/wiki/mwlib

lambshaanxy
  • 22,552
  • 10
  • 68
  • 92
0

I came up with a regex which should do the trick. Let me know if there's anything wrong with it:

r"\[\[(([^\]|]|\](?=[^\]]))*)(\|(([^\]]|\](?=[^\]]))*))?\]\]"

(Ick, I will never get over how ugly these things are!)

Group 1 should give you the wiki link. Group 4 should give you the link text, or None if there is no pipe.

An explanation:

  • (([^\]|]|\](?=[^\]]))*) finds all sequences of characters which are not "|" or "]]". It does this by finding all sequences of characters which are not "|" or "]" OR which are a "]" followed by a character which is not a "]".
  • (\|(([^\]]|\](?=[^\]]))*))? optionally matches a "|" followed by the same regex as above, to get the link text part. The regex is slightly-changed in that it allows "|" characters.
  • Obviously the whole thing is surrounded in \[\[ ... \]\].
  • The (?=...) notation matches a regex but doesn't consume its characters, so they can be matched subsequently. I use it so as not to consume a "|" character which may appear immediately after a "]".

Edit: I fixed the regex to allow a "]" immediately before the "|", as in [[abcd]|efgh]].

mgiuca
  • 20,958
  • 7
  • 54
  • 70
  • One difference between mine and @KennyTM is that while mine provides both the page name and the link text, Kenny's only provides the link text -- *however*, if there is no "|", Kenny's will give you the page name *as* the link text, which is probably what you want. Note my comment on his though. – mgiuca Feb 08 '11 at 03:34
0

This should work:

text = "The CD is composed almost entirely of [[cover version]]s of [[The Beatles]] songs which George Martin [[record producer|produced]] originally."
newText = re.sub(r'\[\[([^\|\]]+\|)?([^\]]+)\]\]',r'\2',text)
Cassie Dee
  • 3,145
  • 2
  • 18
  • 11