Regex check if given string is relative URL

Question

First, I have read this question about how to check if string is an absolute or relative URL. My problem is I need a regex to check if a given string is a relative URL or not, i.e. I need a regex to check if a string does not start with any protocol or double slash //.

Actually, I am doing web scraping with Beautiful Soup and I want to retrieve all relative links. Beautiful Soup uses this syntax:

soup.findAll(href=re.compile(REGEX_TO_MATCH_RELATIVE_URL))

So, that's why I need this.

Test cases are

about.html
tutorial1/
tutorial1/2.html
/
/experts/   
../ 
../experts/ 
../../../   
./  
./about.html

Thank you so much.

You're aware that all your test cases are relative paths? Maybe blend in some absolute paths too for some reasonable testing... — adrianus, Jul 15 '15 at 12:19
I am not very familiar with python but maybe this package can help you: https://docs.python.org/2/library/os.path.html — Verena Haunschmid, Jul 15 '15 at 12:20
And how is this question different from the link you provided? — adrianus, Jul 15 '15 at 12:22
maybe [urlparse](https://docs.python.org/2/library/urlparse.html) can help you. Check if some parts (url attributes, e.g. scheme, netloc) are empty — Pynchia, Jul 15 '15 at 12:23
My answer would be controversial probably, but I would avoid regex whenever possible if it's production code because 9 times out of 10, the next coder won't know regex or will need to stare a it too long to determine what is going on--especially if there are so many possibilities. Start with a comprehensive list of protocols and have one function check for recognized ones. Have another function check for unc paths, unix paths, drive letters, and then have a list of test strings to shove through. The readability will be easier than regex, but regex is fine for use on the fly. — Palu Macil, Jul 15 '15 at 12:24
Palu Macil is right, have a look at a [sample regex](https://regex101.com/r/fW9eM4/1) - are you sure you want that? — Wiktor Stribiżew, Jul 15 '15 at 12:38
@stribizhev yup exactly i want this please do this as a answer :) — maq, Jul 15 '15 at 13:18

Wiktor Stribiżew · Accepted Answer · 2015-07-15T16:00:24.620

Since you find it helpful, I am posting my suggestion.

The regular expression can be:

^(?!www\.|(?:http|ftp)s?://|[A-Za-z]:\\|//).*

See demo

Note that it is becoming more and more unreadable if you start adding exclusions or more alternatives. Thus, perhaps, use VERBOSE mode (declared with re.X):

import re
p = re.compile(r"""^                    # At the start of the string, ...
                   (?!                  # check if next characters are not...
                      www\.             # URLs starting with www.
                     |
                      (?:http|ftp)s?:// # URLs starting with http, https, ftp, ftps
                     |
                      [A-Za-z]:\\       # Local full paths starting with [drive_letter]:\  
                     |
                      //                # UNC locations starting with //
                   )                    # End of look-ahead check
                   .*                   # Martch up to the end of string""", re.X)
print(p.search("./about.html"));          # => There is a match
print(p.search("//dub-server1/mynode"));  # => No match

See IDEONE demo

The other Washington Guedes's regexes

^([a-z0-9]*:|.{0})\/\/.*$ - matches
- ^ - beginning of the string
- ([a-z0-9]*:|.{0}) - 2 alternatives:
- [a-z0-9]*: - 0 or more letters or digits followed with :
- .{0} - an empty string
- \/\/.* - // and 0 or more characters other than newline (note you do not need to escape / in Python)
- $ - end of string

So, you can rewrite it as ^(?:[a-z0-9]*:)?//.*$. he i flag should be used with this regex.

^[^\/]+\/[^\/].*$|^\/[^\/].*$ - is not optimal and has 2 alternatives

Alternative 1:

^ - start of string
[^\/]+ - 1 or more characters other than /
\/ - Literal /
[^\/].*$ - a character other than / followed by any 0 or more characters other than a newline

Alternative 2:

^ - start of string
\/ - Literal /
[^\/].*$ - a symbol other than / followed by any 0 or more characters other than a newline up to the end of string.

It is clear that the whole regex can be shortened to ^[^/]*/[^/].*$. The i option can safely be removed from the regex flags.

Please check, and let me know if you need more assistance with it. — Wiktor Stribiżew, Jul 15 '15 at 13:46
thank you so much this is what i need but can u compare ur and @washington solution?? — maq, Jul 15 '15 at 15:27
Done, please look. I'd say his second regex is very generic. My approach is to check what is not allowed, his is allow anything that resembles a relative URL, just 1 `/` is required. You can actually merge these 2 approaches, BTW, just add my lookahead to the 2nd improved version: `^(?!www\.|(?:http|ftp)s?://|[A-Za-z]:\\|//)[^/]*/[^/].*$`. — Wiktor Stribiżew, Jul 15 '15 at 16:04
ohh thank u bt there is so many regex now :D which one i choose ? — maq, Jul 15 '15 at 16:57

score 2 · Answer 2 · 2015-07-15T13:55:28.963

2

To match absolutes:

/^([a-z0-9]*:|.{0})\/\/.*$/gmi

Live testing here.

And to match relatives:

/^[^\/]+\/[^\/].*$|^\/[^\/].*$/gmi

Live testing here.

edited Jul 15 '15 at 13:55

answered Jul 15 '15 at 13:34

brother i want to match relative urls ur is matching absolute urls – maq Jul 15 '15 at 13:35
thank you so much your `regex` is also working out of the box – maq Jul 15 '15 at 15:26

score 1 · Answer 3 · answered May 13 '20 at 02:27

1

I prefer this one, it captures more edge cases:

Source: https://www.regextester.com/94254

answered May 13 '20 at 02:27

gijswijs

1,958
19
24

Regex check if given string is relative URL

3 Answers3

Linked