I'm trying to parse raw wikipedia article content, e.g. the article on Sweden, using re.sub()
. However, I am running into problems trying to substitute blocks of {{some text}}
, because they can contain further blocks of {{some text}}
.
Here's an abbreviated example from the above article:
{{Infobox country
| conventional_long_name = Kingdom of Sweden
| native_name = {{native name|sv|Konungariket Sverige|icon=no}}
| common_name = Sweden
}}
Some text I do not want parsed.
{{Link GA|eo}}
The curly braces within curly braces recursion could theoretically be arbitrarily nested to any number of levels.
If I match the greedy block of {{.+}}
, everything is matched from {{Infobox
to eo}}
, including the text I do not want matched.
If I match the ungreedy block of {{.+}}
, the part from {{Infobox
to icon=no}}
is matched, as is {{Link GA|eo}}
. But then I'm left with the string | common_name [...] not want parsed.
I also tried variants of \{\{.+(\{\{.+\}\})*.+\}\}
and \{\{[^\{]+(\{\{[^\{]+\}\})*[^\{]+\}\}
, in the hopes of matching only sub-blocks within the larger block, but to no avail.
I'd list all of what I've tried, but I honestly can't remember half and I doubt it'd be of much use anyway. It always comes back to the same problem: that for the double curly end braces }}
to match, there needs to have been the same number of {{
occurrences beforehand.
Is this even solvable using regular expressions, or do I need another solution?