0

I am trying to match URLs or relative paths that do not contain a second colon (after the one in the protocol, e.g., http(s)://).

I want to reject URLs of the form

https://en.wikipedia.org/wiki/Special:BookSources/0-8018-1841-9

or paths of the form

/wiki/Special:BookSources/0-8018-1841-9

with one exception. I want to keep the ones with a second colon if it is followed by an underscore:

https://en.wikipedia.org/wiki/The_Post_Card:_From_Socrates_to_Freud_and_Beyond

or

/wiki/The_Post_Card:_From_Socrates_to_Freud_and_Beyond`

The regex I have now (based on this question and this one) is ^[^:]*[:]*.*(/wiki/)[^:]+$, which solves the first part of my requirement, but not the second.

How would I account for the special case of a colon followed by an underscore?

tat
  • 321
  • 1
  • 19

2 Answers2

1

A negative lookahead might make the most sense here:

^https?://(?!.*:[^_]).*wiki.*

Note that /wiki/Special:BookSources/0-8018-1841-9 strictly, is not a URL because there is no protocol. Rather, it is a path. You may need to slightly modify what I wrote above, but the negative lookahead is an easy solution to your problem.

NightOwl888
  • 55,572
  • 24
  • 139
  • 212
Tim Biegeleisen
  • 502,043
  • 27
  • 286
  • 360
  • It won't accept "URLs with a second colon if it is followed by an underscore" – Naveed Feb 25 '18 at 03:20
  • Updated to clarify URL vs. path. The regex provided satisfies both conditions: see [pythex demo](https://pythex.org/?regex=%5E.*(%3F!.*%3A%5B%5E_%5D).*wiki.*&test_string=https%3A%2F%2Fen.wikipedia.org%2Fwiki%2FThe_Post_Card%3A_From_Socrates_to_Freud_and_Beyond&ignorecase=0&multiline=0&dotall=0&verbose=0) – tat Feb 25 '18 at 03:29
1

When working with url paths that come in a variety of forms, different schemes, or without domain anchors, I like to use urlpath.

Installation:

pip install urlpath

You could use the urlpath library to check each of the parts of the url after the domain to see if they contain a colon without an underscore. This example is useful if you want to avoid regex.

Example:

>>> from urlpath import URL
>>> url = URL('https://en.wikipedia.org/wiki/Special:BookSources/0-8018-1841-9')
>>> any(':' in i and not ':_' in i for i in url.parts[1:])
True
>>> url2 = URL('https://en.wikipedia.org/wiki/The_Post_Card:_From_Socrates_to_Freud_and_Beyond')
>>> any(':' in i and not ':_' in i for i in url2.parts[1:])
False

In this example, the any statement are returning true for urls you want to ignore. If you want to make this example a little more functional, you can also filter using regex.

>>> any(re.search(':[^_]',i) for i in url.parts[1:])
True
>>> any(re.search(':[^_]',i) for i in url2.parts[1:])
False

If you are doing any request with these urls, I'd recommend giving the urlpath library a go. It combines the flexibility of pathlib, the functional urllib.parse, and has requests built it.

>>> url.get()
<Response [200]>
Mike Peder
  • 728
  • 4
  • 8