0

Possible Duplicate:
Regular expression for browser Url

Is this regex perfect for any url ?

preg_match_all(
 '/([www]+(\.|dot))?[a-zA-Z0-9_\.-]+(\.|dot){1,}[com|net|org|info\.]+((\.|dot){0,}[a-zA-Z]){0,}+/i', 
 $url, $regp);
Community
  • 1
  • 1
ITGuru
  • 73
  • 11
  • 3
    `[www]` is not what you think it is. Read about [character classes](http://www.regular-expressions.info/charclass.html) – Amarghosh Jul 08 '10 at 10:43
  • 1
    Did you write that by yourself? And what do you mean by any URL? – Gumbo Jul 08 '10 at 10:43
  • 1
    `museum` is a valid top level domain name like `com`, `net` etc – Amarghosh Jul 08 '10 at 10:44
  • Underscore `_` is not a valid character in domain names. – Amarghosh Jul 08 '10 at 10:47
  • `[a-z0-9.-]+` matches `-a...com.` among other things – Amarghosh Jul 08 '10 at 10:49
  • this regex is not even close to perfect, its highly flawed. look at my post for a valid regex. – atamanroman Jul 08 '10 at 10:57
  • What's up with `{1,}` and `{0,}` quantifiers? Why are you using capital-case ranges with `i` flag? It's just sheer non-sense. – SilentGhost Jul 08 '10 at 10:58
  • @SilentGhost: The flag for case insensitive matching does not just affect character classes but also literal characters. In this case `dot` will also match `dOt`, `DOT`, etc. – Gumbo Jul 08 '10 at 11:02
  • @Gumbo: yeah, I don't understand why there is `dot` in there to begin with. But surely `A-Z` in the character classes is excessive. – SilentGhost Jul 08 '10 at 11:03
  • I get the feeling this is for scraping pages where people have tried to slightly obscure the web address by typing `w w w dot wobble dot comm` rather than a well formed URL – Pete Kirkham Jul 08 '10 at 11:20

6 Answers6

2

Don't use regex for that. If you cant resist, a valid one can be found here: What is the best regular expression to check if a string is a valid URL? but that regex is ridiculous. Try to use your framework for that, if you can (Uri class in .net for example).

Community
  • 1
  • 1
atamanroman
  • 11,607
  • 7
  • 57
  • 81
1

No. In fact it doesn't match URLs at all. It's trying to detect hostnames written in text, like www.example.com.

Its approach is to try to detect some common known TLDs, but:

[com|net|org|info\.]+

is actually a character group, allowing any sequence of characters from the list |.comnetrgif. Probably this was meant:

((com|net|org|info)\.)+

and also [www] is similarly wrong, plus the business with dot doesn't really make any sense.

But this is in general a really bad idea. There are way more TLDs in common use than just those and the 2-letter CCTLDs. Also many/most of the CCTLDs don't have a second-level domain of com/net/org/info. This expression will fail to match those, and will match a bunch of other stuff that's not supposed to be a hostname.

In fact the task of detecting hostnames is basically impossible to do, since a single word can be a hostname, as can any dot-separated sequence of words. (And since internationalised domain names were introduced, almost anything can be a hostname, eg. 例え.テスト.)

bobince
  • 528,062
  • 107
  • 651
  • 834
  • An IP address is also a valid host: `http://127.0.0.1/` is a valid absolute URL. – Gumbo Jul 08 '10 at 10:59
  • ...not to mention IPv6 addresses! Trying to match hostnames/IP addresses in text is never going to be reliable. – bobince Jul 08 '10 at 11:02
1

'any' url is a tough call. In OZ you have .com.au, in the UK it is .co.uk Each country has its own set of rules, and they can change. .xxx has just been approved. And non-ascii characters have been approved now, but I suspect you don't need that.

I would wonder why you want validation which is that tight? Many urls that are right will be excluded, and it does not exlude all incorrect urls. www.thisisnotavalidurl.com would still be accepted.

I would suggest A) using a looser check , just for ([a-zA-Z0-9_.-].)*[a-zA-Z0-9_.-] (or somthing), just as a sanity check B) using a reverse lookup to check if the URL is actually valid if you want to only allow actual real urls.

Oh, and I find this: http://www.fileformat.info/tool/regex.htm to be a really useful tool if I am developing regex, which I am not great at.

Jon
  • 1,013
  • 1
  • 10
  • 17
  • Can i have and .com URL that can bypass this regex ? – ITGuru Jul 10 '10 at 07:21
  • can I have any .com url that can by pass this REGEX preg_match_all( '/([www]+(\.|dot))?[a-zA-Z0-9_\.-]+(\.|dot){1,}[com|net|org|info\.]+((\.|dot){0,}[a-zA-Z]){0,}+/i', $url, $regp); – ITGuru Jul 10 '10 at 07:22
0

[www]+ should be changed for (www)?

(\.|dot){1,} - one and more? mayby you wanted to do ([a-zA-Z0-9_\.-]+(\.|dot)){1,}

Adam Lukaszczyk
  • 4,898
  • 3
  • 22
  • 22
0

A URL also has a protocol like http, which you're missing. You're also missing a lot of TLDs, as already mentioned.

Something like an escaped space (%20) would also not be recognized.

Port numbers can also appear in an URL (e.g. :80)

Mad Scientist
  • 18,090
  • 12
  • 83
  • 109
  • A URL can also be relative. Even an empty string is a valid URL. – Gumbo Jul 08 '10 at 10:50
  • Depending on how pedantic you want to be, a relative *URI* doesn't class as a *URL*. – bobince Jul 08 '10 at 10:58
  • @bobince: It all depends on what specifications your terms are derived from: RFC 1808 states *URL* to be the most common term of a resource locator while RFC 3986 uses the term *URI-reference*. – Gumbo Jul 08 '10 at 11:10
0

No, and you can't create a REGEX that will parse any URI (or URL or URN) - the only way to parse them properly is to read them as per the spec of RFC-3986

nathan
  • 5,402
  • 1
  • 22
  • 18