1

I'm trying to parse raw wikipedia article content, e.g. the article on Sweden, using re.sub(). However, I am running into problems trying to substitute blocks of {{some text}}, because they can contain further blocks of {{some text}}.

Here's an abbreviated example from the above article:

{{Infobox country
| conventional_long_name = Kingdom of Sweden
| native_name = {{native name|sv|Konungariket Sverige|icon=no}}
| common_name = Sweden
}}
Some text I do not want parsed.
{{Link GA|eo}}

The curly braces within curly braces recursion could theoretically be arbitrarily nested to any number of levels.

If I match the greedy block of {{.+}}, everything is matched from {{Infobox to eo}}, including the text I do not want matched.

If I match the ungreedy block of {{.+}}, the part from {{Infobox to icon=no}} is matched, as is {{Link GA|eo}}. But then I'm left with the string | common_name [...] not want parsed.

I also tried variants of \{\{.+(\{\{.+\}\})*.+\}\} and \{\{[^\{]+(\{\{[^\{]+\}\})*[^\{]+\}\}, in the hopes of matching only sub-blocks within the larger block, but to no avail.

I'd list all of what I've tried, but I honestly can't remember half and I doubt it'd be of much use anyway. It always comes back to the same problem: that for the double curly end braces }} to match, there needs to have been the same number of {{ occurrences beforehand.

Is this even solvable using regular expressions, or do I need another solution?

Joel Hinz
  • 24,719
  • 6
  • 62
  • 75
  • 4
    When there are nested patterns, grammars are the way to go. – thefourtheye Nov 14 '13 at 07:32
  • 4
    Of possible interest: [Python MediaWiki parser](https://github.com/pediapress/mwlib), which apparently wikipedia itself uses (for pdf export) – DanielB Nov 14 '13 at 07:39
  • 1
    "Is this even solvable using regular expressions" -- [not with standard regexes, no](http://stackoverflow.com/questions/133601/can-regular-expressions-be-used-to-match-nested-patterns). Paired delimiters and recursion are exactly where they fall down. However, most regex libraries can actually handle this because of lookahead and lookbehind extensions. – jscs Nov 14 '13 at 07:40
  • Thanks, all! I suspected as much, but wanted to be sure. Grammars are a bit overkill for me, but I'll look into lookaround extensions. @DanielB: I'll take a look at the source code, but I'm mostly using wikipedia as an arbitrary source to learn some more regex, so the end results are less important to me than the concepts. Thanks nonetheless for the link. – Joel Hinz Nov 14 '13 at 07:47

1 Answers1

2

Have you considered mwparserfromhell?

import mwparserfromhell
s = """{{Infobox country
| conventional_long_name = Kingdom of Sweden
| native_name = {{native name|sv|Konungariket Sverige|icon=no}}
| common_name = Sweden
}}
Some text I do not want parsed.
{{Link GA|eo}}"""
wikicode = mwparserfromhell.parse(s)
print wikicode.filter_templates()[0]

Prints:

{{Infobox country
| conventional_long_name = Kingdom of Sweden
| native_name = {{native name|sv|Konungariket Sverige|icon=no}}
| common_name = Sweden
}}
TerryA
  • 58,805
  • 11
  • 114
  • 143
  • Upvote for what sounds like a very good idea for MW parsing (plus the module's got a great name), but I'm interested in learning the process rather than in the actual contents of the articles. Thanks for the reply! – Joel Hinz Nov 14 '13 at 07:48