Python Regex stop at '|' character

Question

I am trying to find a URL in a Dokuwiki using python regex. Dokuwikis format URLs like this:

[['insert URL'|Name of External Link]]

I need to design a python regex that captures the URL but stops at '|'

I could try and type out every non-alphanumeric character besides '|' (something like this: (https?://[\w|\.|\-|\?|\/|\=|\+|\!|\@|\#|\$|\%|^|&]*) )

However that sounds really tedious and I might miss one.

Thoughts?

`[^\|]+` reads, anything but "|" – PepperoniPizza Jun 12 '14 at 23:33 — PepperoniPizza, Jun 12 '14 at 23:33
in addition to that, links don't have whitespaces – hjpotter92 Jun 12 '14 at 23:38 — hjpotter92, Jun 12 '14 at 23:38

score 0 · Answer 1 · answered Jun 12 '14 at 23:41

0

You can use negative character sets, or [^things to not match].

In this case, you want to not match |, so you would have [^|].

import re

bool(re.match("[^|]", "a"))
#>>> True

bool(re.match("[^|]", "|"))
#>>> False

answered Jun 12 '14 at 23:41

Veedrac

58,273
15
112
169

score 0 · Answer 2 · answered Jun 12 '14 at 23:43

You expect any character that's not | followed by a | and some other characters that are not ], everything enclosed within double square brackets. This translates to:

pattern = re.compile('\[\[([^\|]+)\|([^/]]+)\]\]')
print pattern.match("[[http://bla.org/path/to/page|Name of External Link]]").groups()

This would print:

('http://bla.org/path/to/page', 'Name of External Link')

If you don't need the name of the link you can just remove the parenthesis around the second group. More on regular expressions in Python here

Clsoe but not working. The proper regex is: `re.compile('\[\[([^|]+)\|([^]]+)\]\]')`. You've got extra chars like a backslash and a slash in your regex that make it fail. — Eric, Aug 14 '21 at 11:05

Python Regex stop at '|' character

2 Answers2