Regex: validate a URL path with no query params

Question

I'm not a regex expert and I'm breaking my head trying to do one that seems very simple and works in python 2.7: validate the path of an URL (no hostname) without the query string. In other words, a string that starts with /, allows alphanumeric values and doesn't allow any other special chars except these: /, ., -

I found this post that is very similar to what I need but for me isn't working at all, I can test with for example aaa and it will return true even if it doesn't start with /.

The current regex that I have kinda working is this one:

[^/+a-zA-Z0-9.-]

but it doesn't work with paths that don't start with /. For example:

/aaa -> true, this is ok
/aaa/bbb -> true, this is ok
/aaa?q=x -> false, this is ok
aaa -> true, this is NOT ok

score 6 · Accepted Answer · answered Oct 17 '12 at 07:08

6

The regex you've defined is a character class. Instead, try:

^\/[/.a-zA-Z0-9-]+$

answered Oct 17 '12 at 07:08

Andrew Cheong

29,362
15
90
145

The difference here being, that you require the string to begin with `/` (`^` means begin-of-line) and that you've added the `+` and end-of-line anchor `$`. Which makes the whole character-class capture optional – Morten Jensen Oct 17 '12 at 07:10
`^\/[\/\.a-zA-Z0-9\-]+$`, the `/`, `.`, and `-` should be escaped. – Jon Oct 17 '12 at 08:15
1

@Jon - You are right about escaping ``/`` in many implementations, but in Python, it need not be escaped. ``.`` need not be escaped either, as long as it's within a character class, at least in any implementation I know of. And ``-`` need not be escaped if it is specified first or last in a character class. Of course, not that it hurts to escape "too much," but just wanted to clarify these details. – Andrew Cheong Oct 17 '12 at 08:22
you might want to use `\A` and `\z` to match the beginning and end of a potentially multi-line string – localhostdotdev Apr 27 '19 at 00:27

score 3 · Answer 2 · answered Oct 17 '12 at 07:26

In other words, a string that starts with /, allows alphanumeric values and doesn't allow any other special chars except these: /, ., -

You are missing some characters that are valid in URLs

import string
import urllib
import urlparse

valid_chars = string.letters + string.digits + '/.-~'
valid_paths = []

urls = ['http://www.my.uni.edu/info/matriculation/enroling.html',
    'http://info.my.org/AboutUs/Phonebook',
    'http://www.library.my.town.va.us/Catalogue/76523471236%2Fwen44--4.98',
    'http://www.my.org/462F4F2D4241522A314159265358979323846',
        'http://www.myu.edu/org/admin/people#andy',
        'http://www.w3.org/RDB/EMP?*%20where%20name%%3Ddobbins']

for i in urls:
   path = urllib.unquote(urlparse.urlparse(i).path)
   if path[0] == '/' and len([i for i in path if i in valid_chars]) == len(path):
        valid_paths.append(path)

score 0 · Answer 3 · answered Oct 17 '12 at 07:07

0

Try this:

^(?:/[a-zA-Z0-9.-&&[^/]]*)+$

Seems to work. See the picture: enter image description here

answered Oct 17 '12 at 07:07

Gábor Lipták

9,646
2
59
113

score 0 · Answer 4 · answered Oct 17 '12 at 07:08

0

Try posting some more code. I can't figure out how you're using your regex from your question. What's confusing me is, your re expression [^/+a-zA-Z0-9.-] basically says:

Match a single character if it is:

not a / or a-z (caps and lower both) or 0-9 or a dot or a dash

It doesn't quite make sense to me without knowing how you use it, as it only matches a single charactre and not a whole URL string.

I'm not sure I understand why you cannot start with a /.

answered Oct 17 '12 at 07:08

Morten Jensen

5,818
3
43
55

Because it's an url path. Right after host, there must be a slash. Let's say you have var ajax_path='http://example.com'+unvalidated_path . If the path doesn't start with /, malicious user could insert ".", like .externaldomain.com and perform XSS attack. – Maciej Krawczyk May 27 '16 at 12:11

Regex: validate a URL path with no query params

4 Answers4

Linked