0

All these years, I used this regEx in javascript as well as php to check for a valid domain name.

Original Version

/^((http|https):\/{2})([w]{3})([\.]{1})([a-zA-Z0-9-]{2,63})([\.]{1})((a[cdefgilmnoqrstuwxz]|aero|arpa)|(b[abdefghijmnorstvwyz]|biz)|(c[acdfghiklmnorsuvxyz]|cat|co.in|com|coop)|d[ejkmoz]|(e[ceghrstu]|edu)|f[ijkmor]|(g[abdefghilmnpqrstuwy]|gov)|h[kmnrtu]|(i[delmnoqrst]|info|int)|(j[emop]|jobs)|k[eghimnprwyz]|l[abcikrstuvy]|(m[acdghklmnopqrstuvwxyz]|mil|mobi|museum)|(n[acefgilopruz]|name|net)|(om|org)|(p[aefghklmnrstwy]|pro)|qa|r[eouw]|s[abcdeghijklmnortvyz]|(t[cdfghjklmnoprtvwz]|travel)|u[agkmsyz]|v[aceginu]|w[fs]|y[etu]|z[amw])$/i

Changed broken version

I added the last part so it could accept and validate what comes after the .com. But I found out that it somehow breaks the whole thing and anything gets in. How do I get this correct?

/^((http|https):\/{2})([w]{3})([\.]{1})([a-zA-Z0-9-]{2,63})([\.]{1})((a[cdefgilmnoqrstuwxz]|aero|arpa)|(b[abdefghijmnorstvwyz]|biz)|(c[acdfghiklmnorsuvxyz]|cat|co.in|com|coop)|d[ejkmoz]|(e[ceghrstu]|edu)|f[ijkmor]|(g[abdefghilmnpqrstuwy]|gov)|h[kmnrtu]|(i[delmnoqrst]|info|int)|(j[emop]|jobs)|k[eghimnprwyz]|l[abcikrstuvy]|(m[acdghklmnopqrstuvwxyz]|mil|mobi|museum)|(n[acefgilopruz]|name|net)|(om|org)|(p[aefghklmnrstwy]|pro)|qa|r[eouw]|s[abcdeghijklmnortvyz]|(t[cdfghjklmnoprtvwz]|travel)|u[agkmsyz]|v[aceginu]|w[fs]|y[etu]|z[amw])([-A-Za-z0-9+&@#\/%=~_|:.]{0,51})$/i

The RegEx works fine. It's only the last part I added that seems to be causing problems ([-A-Za-z0-9+&@#\/%=~_|:.]{0,51})

What I'm trying to do here, is validate the part after the .com. For example, the part after the .com for this question is questions/20217720/regex-to-check-for-validity-of-whats-after-the-com. That's the part I'm trying to validate. But now the tlds do not validate.

Example: http://www.example.com should validate to true

http://www.example.com/ should also validate to true

http://www.example.com/mail should validate to true

http://www.example.comxx should validate to false

http://www.example.comxx/mail should validate to false

Norman
  • 6,159
  • 23
  • 88
  • 141
  • Do you need the regex for javascript or php? – Thew Nov 26 '13 at 12:59
  • See http://stackoverflow.com/questions/106179/regular-expression-to-match-hostname-or-ip-address – Denys Séguret Nov 26 '13 at 13:00
  • @Thew I use it in both. What I've posted here works in javascript (inside a plugin) as well as php. – Norman Nov 26 '13 at 13:01
  • What could be after .com which is a TLD? – user1021726 Nov 26 '13 at 13:03
  • After the .com? See this pages URL. `.com/questions/20217720/regex-to-check-for-validity-of-whats-after-the-com` – Norman Nov 26 '13 at 13:04
  • point taken. I was focused on it being a part of the domain. But since it is not a part of the domain you should be able to just get the path of the file. In JavaScript it would be window.location.pathname. – user1021726 Nov 26 '13 at 13:08
  • @dystroy, that does not answer this question at all. Norman, IMO to validate anything after the `.com` all you can really do is check that each of the characters is in a legal character set. If a website owner chooses to, he can have some hideously invalid looking URLs that will still direct you to a page. – OGHaza Nov 26 '13 at 13:09
  • @OGHaza Exactly. There can only be an arbitrary validation which Normal has to decide and post, as the pathname could be whatever the host wants it to be (as long as it's a valid filename on the host server) – user1021726 Nov 26 '13 at 13:10
  • It doesn't even need to be a valid filename though, it could be part of a `Rewrite`. – OGHaza Nov 26 '13 at 13:11
  • @Norman What's the intended use? What's a "valid" path/filename according to you? – user1021726 Nov 26 '13 at 13:12
  • See my edit. After I added the part I mentioned there, the tlds will not validate. Anything typed is accepted. – Norman Nov 26 '13 at 13:13
  • @user1021726 Just this much is ok `([-A-Za-z0-9+&@#\/%=~_|:.]{0,51})` A to Z, numbers, allowed special chars and a max length of 51. – Norman Nov 26 '13 at 13:13
  • @Norman but why are you trying to validate that? What's after the TLD is a path/filename. Why do you need to validate that? do you have any special rules or special cases? – user1021726 Nov 26 '13 at 13:14
  • @user1021726 Right, I don't want more than 51 character to get through after the .com. It's simple. I almost have it correct. Just cannot understand why the tld part will not validate after I add my part. Eg: If I type .comxxx, it'll return true. – Norman Nov 26 '13 at 13:20
  • @Norman : you forgot to add a slash `/` after the TLD and before the `([-A-Za-z0-9+&@#\/%=~_|:.]{0,51})` part – foibs Nov 26 '13 at 13:23
  • If the only rule you have is that it can't be more than 51 characters, I'd suggest just checking the .length on the pathname. There's no need to use regex for that. – user1021726 Nov 26 '13 at 13:23
  • @foibs, his character class matches `/` – OGHaza Nov 26 '13 at 13:24
  • @OGHaza: it does, but he wants a domain validation, so he needs to match exactly the TLD before the slash, and then continue with the rest of the characters. – foibs Nov 26 '13 at 13:28
  • @Norman [I don't see the problem](http://regexr.com?37cde) – OGHaza Nov 26 '13 at 13:29
  • Http://www.google.com/mail -> True http://www.google.comxx/mail -> Still validates as true when it shoudl have been false – Norman Nov 26 '13 at 13:30
  • Foibs is right, you need a to match a `/` after the TLD to stop . – OGHaza Nov 26 '13 at 13:34
  • The `/` after the .com need or need not be there. If there's nothing after the `.com` a user need not add the `\`. Right now without the `\` it'll show false. Which Is why I did the above in the first place :) – Norman Nov 26 '13 at 13:38
  • so just put it in the last rule parentheses and make it optional with a `?` – foibs Nov 26 '13 at 13:42

3 Answers3

1

Doe this fit your needs:

(\/[-A-Za-z0-9+&@#\/%=~_|:.]{0,50})?

The whole group is optional, but if anything appears after the TLD then it requires a / to be the first character (reduced 51 to 50 to compensate).

The full regex:

/^((http|https):\/{2})([w]{3})([\.]{1})([a-zA-Z0-9-]{2,63})([\.]{1})((a[cdefgilmnoqrstuwxz]|aero|arpa)|(b[abdefghijmnorstvwyz]|biz)|(c[acdfghiklmnorsuvxyz]|cat|co.in|com|coop)|d[ejkmoz]|(e[ceghrstu]|edu)|f[ijkmor]|(g[abdefghilmnpqrstuwy]|gov)|h[kmnrtu]|(i[delmnoqrst]|info|int)|(j[emop]|jobs)|k[eghimnprwyz]|l[abcikrstuvy]|(m[acdghklmnopqrstuvwxyz]|mil|mobi|museum)|(n[acefgilopruz]|name|net)|(om|org)|(p[aefghklmnrstwy]|pro)|qa|r[eouw]|s[abcdeghijklmnortvyz]|(t[cdfghjklmnoprtvwz]|travel)|u[agkmsyz]|v[aceginu]|w[fs]|y[etu]|z[amw])(\/[-A-Za-z0-9+&@#\/%=~_|:.]{0,50})?$/i

RegExr Example

OGHaza
  • 4,795
  • 7
  • 23
  • 29
  • After a quick test, it does. But let me just test some more, and we'll find more loop holes. – Norman Nov 26 '13 at 13:46
0

For PHP, you could use parse_url (documentation) as an alternative.

<?php
    $info = parse_url($url);

    // is .com domain
    if(end(explode('.', $info['host'])) == "com"){
        $behinddotcom = $info['path'] . '?' . $info['query'];
    }
?>
Thew
  • 15,789
  • 18
  • 59
  • 100
0

What comes after the TLD is a path/filename. Unless you have any special cases or rules to adhere too there is no need to validate this.

If you just need to extract it this is a simple matter. In e.g. JavaScript you would do

window.location.pathname // returns "/questions/20217720/regex-to-check-for-validity-of-whats-after-the-com"
user1021726
  • 638
  • 10
  • 23