Extract complex links with python

Question

I have a RegEx that should find all alphanumeric characters or texts between square brackets like the example in the link above. Those texts are in fact links with descriptions (for example [[Toto|there's a link here]] is a link to the page Toto)

But the problem is that I could have other text between square brackets and so like you can see in the link it doesn't recognize the brackets in the end (]]).

There's also an another important patern in those links there's pipes in it (|) that separate my texts in two or three parts. In the case there's two parts in it I only want to get the text on the left and in the case there's three parts I want to the text on the right.

Example:

[[File:Euclid flowchart 1.png|vignette|[[Flowchart]] of an algorithm ([[Euclid's algorithm]]).]]

I only want [[Flowchart]] of an algorithm ([[Euclid's algorithm]]). part (this is a png with a description below and in this description there's other links)

[[Babylone|Babyloniens]]

I want Babylone

In the first example there's other links inside but I can easily extract them with my first regex or a recursion.

You can see an example of my code here

In general it's not good to try parsing nested brackets with regex, see https://stackoverflow.com/questions/133601/can-regular-expressions-be-used-to-match-nested-patterns . I suggest using a library such as https://wikipedia.readthedocs.io/en/latest/ to parse wiki text correctly. — Alex Hall, Mar 07 '19 at 10:36
It is possible to solve this with a regex , however, there will be lots of assumptions and the pattern will be too long and cumbersome. A parser is recommended here. — Wiktor Stribiżew, Mar 07 '19 at 10:38
Thanks for your answers. But I would like to do it without any external libraries. The goal here is to do it myself. Do you have any example of how parsing this ? — Jack, Mar 07 '19 at 10:42

score 1 · Answer 1 · answered Mar 07 '19 at 11:06

You could try this pattern \[\[(.+?)\|(.+?)(\|(.+))?\]\]

Pattern captures to groups string between pipes |. I used non-greedy operator .+?, otherwise it would capture everything until last pipe. Non-greedy operator captures until first occurence of pipe. But last operator is greedy - that's because we want to capture everything until last ]], so the opposite of what we wanted previously.

Also (\|(.+))? means that third part (uncluding additional pipe charater) is optional (can occur at most once).

This also needs extra logic - you need to first check if there is a fourth capturing group, if it is, it means the string was splitted by pipes to three parts. If it isn't present, then it was splitted only to two parts, in which case you want to get first captring group.

Demo

Extract complex links with python

1 Answers1