REGEX URL regular expression

Question

Possible Duplicate:
Regular expression for browser Url

Is this regex perfect for any url ?

preg_match_all(
 '/([www]+(\.|dot))?[a-zA-Z0-9_\.-]+(\.|dot){1,}[com|net|org|info\.]+((\.|dot){0,}[a-zA-Z]){0,}+/i', 
 $url, $regp);

`[www]` is not what you think it is. Read about [character classes](http://www.regular-expressions.info/charclass.html) — Amarghosh, Jul 08 '10 at 10:43
Did you write that by yourself? And what do you mean by any URL? — Gumbo, Jul 08 '10 at 10:43
`museum` is a valid top level domain name like `com`, `net` etc — Amarghosh, Jul 08 '10 at 10:44
this regex is not even close to perfect, its highly flawed. look at my post for a valid regex. — atamanroman, Jul 08 '10 at 10:57
What's up with `{1,}` and `{0,}` quantifiers? Why are you using capital-case ranges with `i` flag? It's just sheer non-sense. — SilentGhost, Jul 08 '10 at 10:58
@SilentGhost: The flag for case insensitive matching does not just affect character classes but also literal characters. In this case `dot` will also match `dOt`, `DOT`, etc. — Gumbo, Jul 08 '10 at 11:02
@Gumbo: yeah, I don't understand why there is `dot` in there to begin with. But surely `A-Z` in the character classes is excessive. — SilentGhost, Jul 08 '10 at 11:03
I get the feeling this is for scraping pages where people have tried to slightly obscure the web address by typing `w w w dot wobble dot comm` rather than a well formed URL — Pete Kirkham, Jul 08 '10 at 11:20

score 2 · Answer 1 · edited May 23 '17 at 11:47

2

Don't use regex for that. If you cant resist, a valid one can be found here: What is the best regular expression to check if a string is a valid URL? but that regex is ridiculous. Try to use your framework for that, if you can (Uri class in .net for example).

edited May 23 '17 at 11:47

Community

1
1

answered Jul 08 '10 at 10:54

atamanroman

11,607
7
57
81

score 1 · Answer 2 · answered Jul 08 '10 at 10:55

No. In fact it doesn't match URLs at all. It's trying to detect hostnames written in text, like www.example.com.

Its approach is to try to detect some common known TLDs, but:

[com|net|org|info\.]+

is actually a character group, allowing any sequence of characters from the list |.comnetrgif. Probably this was meant:

((com|net|org|info)\.)+

and also [www] is similarly wrong, plus the business with dot doesn't really make any sense.

But this is in general a really bad idea. There are way more TLDs in common use than just those and the 2-letter CCTLDs. Also many/most of the CCTLDs don't have a second-level domain of com/net/org/info. This expression will fail to match those, and will match a bunch of other stuff that's not supposed to be a hostname.

In fact the task of detecting hostnames is basically impossible to do, since a single word can be a hostname, as can any dot-separated sequence of words. (And since internationalised domain names were introduced, almost anything can be a hostname, eg. 例え.テスト.)

An IP address is also a valid host: `http://127.0.0.1/` is a valid absolute URL. — Gumbo, Jul 08 '10 at 10:59
...not to mention IPv6 addresses! Trying to match hostnames/IP addresses in text is never going to be reliable. — bobince, Jul 08 '10 at 11:02

Jon · Answer 3 · 2010-07-08T11:10:19.030

'any' url is a tough call. In OZ you have .com.au, in the UK it is .co.uk Each country has its own set of rules, and they can change. .xxx has just been approved. And non-ascii characters have been approved now, but I suspect you don't need that.

I would wonder why you want validation which is that tight? Many urls that are right will be excluded, and it does not exlude all incorrect urls. www.thisisnotavalidurl.com would still be accepted.

I would suggest A) using a looser check , just for ([a-zA-Z0-9_.-].)*[a-zA-Z0-9_.-] (or somthing), just as a sanity check B) using a reverse lookup to check if the URL is actually valid if you want to only allow actual real urls.

Oh, and I find this: http://www.fileformat.info/tool/regex.htm to be a really useful tool if I am developing regex, which I am not great at.

can I have any .com url that can by pass this REGEX preg_match_all( '/([www]+(\.|dot))?[a-zA-Z0-9_\.-]+(\.|dot){1,}[com|net|org|info\.]+((\.|dot){0,}[a-zA-Z]){0,}+/i', $url, $regp); — ITGuru, Jul 10 '10 at 07:22

score 0 · Answer 4 · answered Jul 08 '10 at 10:46

0

[www]+ should be changed for (www)?

(\.|dot){1,} - one and more? mayby you wanted to do ([a-zA-Z0-9_\.-]+(\.|dot)){1,}

answered Jul 08 '10 at 10:46

Adam Lukaszczyk

4,898
3
22
22

score 0 · Answer 5 · answered Jul 08 '10 at 10:49

0

A URL also has a protocol like http, which you're missing. You're also missing a lot of TLDs, as already mentioned.

Something like an escaped space (%20) would also not be recognized.

Port numbers can also appear in an URL (e.g. :80)

answered Jul 08 '10 at 10:49

Mad Scientist

18,090
12
83
109

A URL can also be relative. Even an empty string is a valid URL. – Gumbo Jul 08 '10 at 10:50
Depending on how pedantic you want to be, a relative *URI* doesn't class as a *URL*. – bobince Jul 08 '10 at 10:58
@bobince: It all depends on what specifications your terms are derived from: RFC 1808 states *URL* to be the most common term of a resource locator while RFC 3986 uses the term *URI-reference*. – Gumbo Jul 08 '10 at 11:10

score 0 · Answer 6 · answered Jul 08 '10 at 10:54

0

No, and you can't create a REGEX that will parse any URI (or URL or URN) - the only way to parse them properly is to read them as per the spec of RFC-3986

answered Jul 08 '10 at 10:54

nathan

5,402
1
22
18

REGEX URL regular expression

6 Answers6