-2

I am trying to find a URL in a Dokuwiki using python regex. Dokuwikis format URLs like this:

[['insert URL'|Name of External Link]]

I need to design a python regex that captures the URL but stops at '|'

I could try and type out every non-alphanumeric character besides '|' (something like this: (https?://[\w|\.|\-|\?|\/|\=|\+|\!|\@|\#|\$|\%|^|&]*) )

However that sounds really tedious and I might miss one.

Thoughts?

Padraic Cunningham
  • 176,452
  • 29
  • 245
  • 321
user3735930
  • 21
  • 1
  • 4

2 Answers2

0

You can use negative character sets, or [^things to not match].

In this case, you want to not match |, so you would have [^|].

import re

bool(re.match("[^|]", "a"))
#>>> True

bool(re.match("[^|]", "|"))
#>>> False
Veedrac
  • 58,273
  • 15
  • 112
  • 169
0

You expect any character that's not | followed by a | and some other characters that are not ], everything enclosed within double square brackets. This translates to:

pattern = re.compile('\[\[([^\|]+)\|([^/]]+)\]\]')
print pattern.match("[[http://bla.org/path/to/page|Name of External Link]]").groups()

This would print:

('http://bla.org/path/to/page', 'Name of External Link')

If you don't need the name of the link you can just remove the parenthesis around the second group. More on regular expressions in Python here

Ion Scerbatiuc
  • 1,151
  • 6
  • 10
  • Clsoe but not working. The proper regex is: `re.compile('\[\[([^|]+)\|([^]]+)\]\]')`. You've got extra chars like a backslash and a slash in your regex that make it fail. – Eric Aug 14 '21 at 11:05