8

I'm not a regex expert and I'm breaking my head trying to do one that seems very simple and works in python 2.7: validate the path of an URL (no hostname) without the query string. In other words, a string that starts with /, allows alphanumeric values and doesn't allow any other special chars except these: /, ., -

I found this post that is very similar to what I need but for me isn't working at all, I can test with for example aaa and it will return true even if it doesn't start with /.

The current regex that I have kinda working is this one:

[^/+a-zA-Z0-9.-]

but it doesn't work with paths that don't start with /. For example:

  • /aaa -> true, this is ok
  • /aaa/bbb -> true, this is ok
  • /aaa?q=x -> false, this is ok
  • aaa -> true, this is NOT ok
Community
  • 1
  • 1
Juancho
  • 629
  • 7
  • 17

4 Answers4

6

The regex you've defined is a character class. Instead, try:

^\/[/.a-zA-Z0-9-]+$
Andrew Cheong
  • 29,362
  • 15
  • 90
  • 145
  • The difference here being, that you require the string to begin with `/` (`^` means begin-of-line) and that you've added the `+` and end-of-line anchor `$`. Which makes the whole character-class capture optional – Morten Jensen Oct 17 '12 at 07:10
  • `^\/[\/\.a-zA-Z0-9\-]+$`, the `/`, `.`, and `-` should be escaped. – Jon Oct 17 '12 at 08:15
  • 1
    @Jon - You are right about escaping ``/`` in many implementations, but in Python, it need not be escaped. ``.`` need not be escaped either, as long as it's within a character class, at least in any implementation I know of. And ``-`` need not be escaped if it is specified first or last in a character class. Of course, not that it hurts to escape "too much," but just wanted to clarify these details. – Andrew Cheong Oct 17 '12 at 08:22
  • you might want to use `\A` and `\z` to match the beginning and end of a potentially multi-line string – localhostdotdev Apr 27 '19 at 00:27
3

In other words, a string that starts with /, allows alphanumeric values and doesn't allow any other special chars except these: /, ., -

You are missing some characters that are valid in URLs

import string
import urllib
import urlparse

valid_chars = string.letters + string.digits + '/.-~'
valid_paths = []

urls = ['http://www.my.uni.edu/info/matriculation/enroling.html',
    'http://info.my.org/AboutUs/Phonebook',
    'http://www.library.my.town.va.us/Catalogue/76523471236%2Fwen44--4.98',
    'http://www.my.org/462F4F2D4241522A314159265358979323846',
        'http://www.myu.edu/org/admin/people#andy',
        'http://www.w3.org/RDB/EMP?*%20where%20name%%3Ddobbins']

for i in urls:
   path = urllib.unquote(urlparse.urlparse(i).path)
   if path[0] == '/' and len([i for i in path if i in valid_chars]) == len(path):
        valid_paths.append(path)
Burhan Khalid
  • 169,990
  • 18
  • 245
  • 284
0

Try this:

^(?:/[a-zA-Z0-9.-&&[^/]]*)+$

Seems to work. See the picture: enter image description here

Gábor Lipták
  • 9,646
  • 2
  • 59
  • 113
0

Try posting some more code. I can't figure out how you're using your regex from your question. What's confusing me is, your re expression [^/+a-zA-Z0-9.-] basically says:

Match a single character if it is:

not a / or a-z (caps and lower both) or 0-9 or a dot or a dash

It doesn't quite make sense to me without knowing how you use it, as it only matches a single charactre and not a whole URL string.

I'm not sure I understand why you cannot start with a /.

Morten Jensen
  • 5,818
  • 3
  • 43
  • 55
  • Because it's an url path. Right after host, there must be a slash. Let's say you have var ajax_path='http://example.com'+unvalidated_path . If the path doesn't start with /, malicious user could insert ".", like .externaldomain.com and perform XSS attack. – Maciej Krawczyk May 27 '16 at 12:11