I'm trying to convert some documents (Wikipedia articles) which contain links with a specific markdown convention. I want to render these to be reader-friendly without links. The convention is:
- Names in double-brackets with of the pattern
[[Article Name|Display Name]]
should be captured ignoring the pipe and preceding text as well as enclosing brackets:Display Name
. - Names in double-brackets of the pattern
[[Article Name]]
should be captured without the brackets:Article Name
.
Nested approach (produces desired result)
I know I can handle #1 and #2 in a nestedre.sub()
expression. For example, this does what I want:
s = 'including the [[Royal Danish Academy of Sciences and Letters|Danish Academy of Sciences]], [[Norwegian Academy of Science and Letters|Norwegian Academy of Sciences]], [[Russian Academy of Sciences]], and [[National Academy of Sciences|US National Academy of Sciences]].'
re.sub('\[\[(.*?\|)(.*?)\]\]','\\2', # case 1
re.sub('\[\[([^|]+)\]\]','\\1',s) # case 2
)
# result is correct:
'including the Danish Academy of Sciences, Norwegian Academy of Sciences, Russian Academy of Sciences, and US National Academy of Sciences.'
Single-pass approach (looking for solution here)
For efficiency and my own improvement, I would like to know whether there is a single-pass approach.
What I have tried: In an optional group 1, I want to greedy-capture everything between [[
and a |
(if it exists). Then in group 2, I want to capture everything else up to the ]]
. Then I want to return only group 2.
My problem is in making the greedy capture optional:
re.sub('\[\[([^|]*\|)?(.*?)\]\]','\\2',s)
# does NOT return the desired result:
'including the Danish Academy of Sciences, Norwegian Academy of Sciences, US National Academy of Sciences.'
# is missing: 'Russian Academy of Sciences, and '