3

My regex successfully validates many URLs except http://www.google

Here's my URL validator in JSFiddle: http://jsfiddle.net/z23nZ/2/

It correctly validates the following URLs:

http://www.google.com gives True

www.google.com gives True

http://www.rootsweb.ancestry.com/~mopoc/links.htm gives True

http:// www. gives False

...but not this one:

http://www.google gives True

It's not correct to return true in this case. How can I validate that case?

Sreenath Plakkat
  • 1,765
  • 5
  • 20
  • 31
  • 6
    The fact of the matter is that `http://www.google` is a valid format for a URL (in fact, with the new custom TLD's coming out, it may even become a valid address, though more likely it'll just be `http://google`). If you want to check that it's a valid address, the only way of doing that is to try to access it and see if a server responds (or at least look it up in your DNS). – Vala Jul 10 '12 at 09:58
  • I'm not really keen to try to analyse such a long and complicated regex, but I will say it is more complicated than it needs to be with constructs like `[a-z]|\d|-|\.|_|~` instead of `[a-z\d._~-]` repeated throughout. (If you allowed upper-case you could simplify further.) – nnnnnn Jul 10 '12 at 10:02
  • [have a look at this SO post. ][1] might help you... [1]: http://stackoverflow.com/questions/161738/what-is-the-best-regular-expression-to-check-if-a-string-is-a-valid-url – patel.milanb Jul 10 '12 at 10:02
  • @Thor84no is it correct to use http:// www.hyundai instead of http: //www.hyundai.com/in/en/main/ my friend? – Sreenath Plakkat Jul 10 '12 at 10:10
  • 1
    Try this http://jsfiddle.net/z23nZ/4/ may help – Raghav Jul 10 '12 at 10:22
  • hi @iNan ur fiddle works grt but it returns false for http:// google.com which is a valid format i need to return true for that pls help me.. – Sreenath Plakkat Jul 10 '12 at 10:25
  • @SreenathPlakkat I'm not saying it's an appropriate replacement, it won't work of course, but I'm saying it's a valid URL. A regex could determine whether or not you're following valid syntax, but it can't determine whether it's the right address or not. It's almost not worth determining whether or not it's a valid URL format however because the URL syntax is so loose even `a://a` would be a valid URL - you'd just be expected to implement a protocol called `a`. – Vala Jul 10 '12 at 10:30

1 Answers1

1

I think you need to way simplify this. There are plenty of URL validation RegExes out there, but as an exercise, I'll go through my thought process for constructing one.

  1. First, you need to match a protocol if there is one: /((http|ftp)s?:\/\/)?
  2. Then match any series of non-whitespace characters: \S+
  3. If you're trying to pick out URLs from text, you'll want to look for signs that it is a URL. Look for dots or slashes, then more non-whitespace: [\.\/]\S*/

Now put it all together:

/(((http|ftp)s?:\/\/)|(\S+[\.\/]))\S*[^\s\.]*/

I'm guessing that your attempting to look for www.google is because of the new TLDs... the fact is, such URLs might just look like google, and so any word could be a URL. Trying to come up with a catch-all regex which matches valid URLs and nothing else isn't possible, so you're best just going with something simple like the above.

Edit: I've stuck a | in there between the protocol part and the non-whitespace-then-dot-or-slash part to match http://google if people choose to write new URLs like that

Edit 2: See comments for the next improvement. It makes sure google.com matches, http://google matches, and even google/ matches, but not a..

Nathan MacInnes
  • 11,033
  • 4
  • 35
  • 50
  • its absltly correct my frnd we can't validate so perfectly . – Sreenath Plakkat Jul 10 '12 at 11:43
  • 1
    I'm not sure I understand the thinking behind the `|`. The bar won't help matching `http://google`, at least not in the way you seem to be intending. The main result of that as far as I can see is you've got a short-circuit that will accept almost anything (according to that regex `a.` is valid for example). – Vala Jul 10 '12 at 11:52
  • 1
    @Thor84no, The idea of the `|` is that it'd accept `http://google`, or `google.com`, and the point of the `*` at the end (instead of a `+`) is that URLs can end in slashes. `a.` does slip through the net though. I actually meant to cater for that because I hate it when emails get parsed with a full-stop included as part of the hyperlink. I'll fix it. – Nathan MacInnes Jul 10 '12 at 13:54