4

I wrote the following regex:

(https?:\/\/)?([da-z\.-]+)\.([a-z]{2,6})(\/(\w|-)*)*\/?

Its behaviour can be seen here: http://gskinner.com/RegExr/?34b8m

I wrote the following JavaScript code:

var urlexp = new RegExp(
    '^(https?:\/\/)?([da-z\.-]+)\.([a-z]{2,6})(\/(\w|-)*)*\/?$', 'gi'
);
document.write(urlexp.test("blaaa"))

And it returns true even though the regex was supposed to not allow single words as valid.

What am I doing wrong?

Martin Atkins
  • 62,420
  • 8
  • 120
  • 138
Marin
  • 1,311
  • 16
  • 35
  • This is why I hate using the `new RegExp` Construct for Regular Expression initialization in JS. Every backslash has to be doubled. Try the exact same code but with `var urlexp = /^(https?:\/\/)?([da-z\.-]+)\.([a-z]{2,6})(\/(\w|-)*)*\/?$/gi` – FrankieTheKneeMan Mar 30 '13 at 08:53
  • 3
    Also, when using `new RegExp`, you don't have to escape your forward slashes - that's exclusively for with you're using `/regex/mod` notation (like you don't have to escape your single quotes in a double quoted string and vice versa), so `var urlexp = new RegExp('^(https?://)?([da-z.-]+)\\.([a-z]{2,6})(/(\\w|-)*)*/?$', 'gi');` will work as well. – FrankieTheKneeMan Mar 30 '13 at 08:56
  • possible duplicate of [Why this javascript regex doesn't work?](http://stackoverflow.com/questions/7427731/why-this-javascript-regex-doesnt-work) – Jerry Mar 31 '14 at 07:16

1 Answers1

7

Your problem is that JavaScript is viewing all your escape sequences as escapes for the string. So your regex goes to memory looking like this:

^(https?://)?([da-z.-]+).([a-z]{2,6})(/(w|-)*)*/?$

Which you may notice causes a problem in the middle when what you thought was a literal period turns into a regular expressions wildcard. You can solve this in a couple ways. Using the forward slash regular expression syntax JavaScript provides:

var urlexp = /^(https?:\/\/)?([da-z\.-]+)\.([a-z]{2,6})(\/(\w|-)*)*\/?$/gi

Or by escaping your backslashes (and not your forward slashes, as you had been doing - that's exclusively for when you're using /regex/mod notation, just like you don't have to escape your single quotes in a double quoted string and vice versa):

var urlexp = new RegExp('^(https?://)?([da-z.-]+)\\.([a-z]{2,6})(/(\\w|-)*)*/?$', 'gi')

Please note the double backslash before the w - also necessary for matching word characters.

A couple notes on your regular expression itself:

[da-z.-]

d is contained in the a-z range. Unless you meant \d? In that case, the slash is important.

(/(\w|-)*)*/?

My own misgivings about the nested Kleene stars aside, you can whittle that alternation down into a character class, and drop the terminating /? entirely, as a trailing slash will be match by the group as you've given it. I'd rewrite as:

(/[\w-]*)*

Though, maybe you'd just like to catch non space characters?

(/[^/\s]*)*

Anyway, modified this way your regular expression winds up looking more like:

^(https?://)?([\da-z.-]+)\.([a-z]{2,6})(/[\w-]*)*$

Remember, if you're going to use string notation: Double EVERY backslash. If you're going to use native /regex/mod notation (which I highly recommend), escape your forward slashes.

FrankieTheKneeMan
  • 6,645
  • 2
  • 26
  • 37
  • 1
    Very detailed, thanks for both the clarification and additional suggestions! – Marin Mar 30 '13 at 09:20
  • @Marin You could also use this Regex escape method here to escape all the backslashes in a string: http://stackoverflow.com/a/2593661/1726343 – Asad Saeeduddin Mar 30 '13 at 09:37
  • @asad - that will turn any string into a regular expression that matches the literal string passed in - not exactly called for in this situation. – FrankieTheKneeMan Mar 30 '13 at 09:40
  • @FrankieTheKneeMan No. It will add backslashes to all regex metacharacters in the string, then return the modified string. Take a closer look. – Asad Saeeduddin Mar 30 '13 at 09:43
  • @asad Yes, I'm aware. But making a regular expression out of that string using new RegExp will cause the string to match the exact string passed into the function. So `new RegExp(RegExp.quote('[a-z]'))` will create the regular expression `/\[a-z\]/`, which does not match `'a'` or `'b'`, only `'[a-z]'`. That would be a pretty big problem in this situation. – FrankieTheKneeMan Mar 30 '13 at 09:47
  • 1
    @FrankieTheKneeMan Ah, I see what you mean. I misunderstood what the OP required. I guess that could still work if the regex in the function was reduced to `/([\\])/` – Asad Saeeduddin Mar 30 '13 at 09:52
  • 2
    @asad Not really. The problem was that backslashes weren't surviving into the in memory string representation, and as such the meaning was being changed. That's the whole problem with backslashes and string representations. So `new RegExp(RegExp.quote('[a\-z]'))` would (with your new regex) generate the regular expression `/[a-z]/`, because the string that the function saw would look like `[a-z]`, not containing a backslash at all. `new RegExp(RegExp.quote('[a\\-z]'))` would send a string that looked like `[a\-z]`, but generate the regular expression `/[a\\-z]/`, which is dangerously wrong. – FrankieTheKneeMan Mar 30 '13 at 09:58