1

I'm trying to convert some documents (Wikipedia articles) which contain links with a specific markdown convention. I want to render these to be reader-friendly without links. The convention is:

  1. Names in double-brackets with of the pattern [[Article Name|Display Name]] should be captured ignoring the pipe and preceding text as well as enclosing brackets: Display Name.
  2. Names in double-brackets of the pattern [[Article Name]] should be captured without the brackets: Article Name.

Nested approach (produces desired result)

I know I can handle #1 and #2 in a nestedre.sub() expression. For example, this does what I want:

s = 'including the [[Royal Danish Academy of Sciences and Letters|Danish Academy of Sciences]], [[Norwegian Academy of Science and Letters|Norwegian Academy of Sciences]], [[Russian Academy of Sciences]], and [[National Academy of Sciences|US National Academy of Sciences]].'

re.sub('\[\[(.*?\|)(.*?)\]\]','\\2',         # case 1
       re.sub('\[\[([^|]+)\]\]','\\1',s)     # case 2
)
# result is correct:
'including the Danish Academy of Sciences, Norwegian Academy of Sciences, Russian Academy of Sciences, and US National Academy of Sciences.'

Single-pass approach (looking for solution here)

For efficiency and my own improvement, I would like to know whether there is a single-pass approach.

What I have tried: In an optional group 1, I want to greedy-capture everything between [[ and a | (if it exists). Then in group 2, I want to capture everything else up to the ]]. Then I want to return only group 2.

My problem is in making the greedy capture optional:

re.sub('\[\[([^|]*\|)?(.*?)\]\]','\\2',s)
# does NOT return the desired result:
'including the Danish Academy of Sciences, Norwegian Academy of Sciences, US National Academy of Sciences.'
# is missing: 'Russian Academy of Sciences, and '
C8H10N4O2
  • 18,312
  • 8
  • 98
  • 134
  • What about [`\[{2}(?:(?:(?!]{2})[^|])+\|)*((?:(?!]{2})[^|])+)]{2}`](https://regex101.com/r/9H34T9/2)? – ctwheels Mar 28 '18 at 15:44
  • I guess that works, if you want to answer go ahead. I'm interested in whether it can be done without look-aheads or look-backs. Seems like the nested sub would be faster, but would need to test. – C8H10N4O2 Mar 28 '18 at 15:49
  • 1
    Have you thought to use a python library that trasform wiki markup - creole - in html markup? https://pypi.python.org/pypi/python-creole/ – Lupanoide Mar 28 '18 at 15:53
  • I added an answer. I also added a second regex in the answer that doesn't use lookarounds if you'd prefer to use that, but it's less specific. By that, what I mean is that if a string includes `]` in it, it will stop there (instead of the location of `]]`) – ctwheels Mar 28 '18 at 15:54
  • @Lupanoide I didn't know that was a thing -- thanks. – C8H10N4O2 Mar 28 '18 at 15:59

1 Answers1

2

See regex in use here

\[{2}(?:(?:(?!]{2})[^|])+\|)*((?:(?!]{2})[^|])+)]{2}
  • \[{2} Match [[
  • (?:(?:(?!]{2})[^|])+\|)* Matches the following any number of times
    • (?:(?!]{2})[^|])+ Tempered greedy token matching any character one or more times except | or location that matches ]]
    • \| Matches | literally
  • ((?:(?!]{2})[^|])+) Capture the following into capture group 1
    • (?:(?!]{2})[^|])+ Tempered greedy token matching any character one or more times except | or location that matches ]]
  • ]{2} Match ]]

Replacement \1

Result:

including the Danish Academy of Sciences, Norwegian Academy of Sciences, Russian Academy of Sciences, and US National Academy of Sciences.

Another alternative that may work for you is the following. It's less specific than the regex above but doesn't include any lookarounds.

\[{2}(?:[^]|]+\|)*([^]|]+)]{2}
ctwheels
  • 21,901
  • 9
  • 42
  • 77
  • Yes, the alternative is what I was looking for. A reason that it works (and my attempt didn't) is that it is also excluding the close brackets as well as the pipe from the optional prefix. Thanks. – C8H10N4O2 Mar 28 '18 at 15:56