2
  • www.example.com
  • foo.example.com
  • foo.example.co.uk
  • foo.bar.example.com
  • foo.bar.example.co.uk

I've got these URL's here, and want to always end up with 2 variables:

$domainName = "example"
$domainNameSuffix = ".com" OR ".co.uk"

If I someone could get me from $url being one of the urls, all the way down to $newUrl being close to "example.co.uk", it would be a blessing.

Note that the urls are going to be completely "random", we might end up having "foo.bar.example2.com.au" too, so ... you know... ugh. (asking for the impossible?)

Cheers,

Dirk v B
  • 365
  • 4
  • 19
  • The title is a bit misleading here. You are parsing domain names, not URLs from what it looks like. Basically, this comes down to looking for a database of TLDs and their associated secondary levels for country codes like uk and au. There's no way to solve this problem without such information. – Matthew Mar 15 '11 at 23:51
  • So here is a duplicate: http://stackoverflow.com/questions/4963202/domain-regex-split - you want to look at RobertPitt`s solution as alternative. As said, it can be done on a best bet basis. You can't even get reliable results with TLD probing ala `dig +all co.uk` – mario Mar 15 '11 at 23:54

5 Answers5

3

We had a few questions like this before, but I can't find a good one right now either. The crux is, this cannot be done reliably. You would need a long list of special TLDs (like .uk and .au) which have their own .com/.net level.

But as general approach and simple solution you could use:

preg_match('#([\w-]+)\.(\w+(\.(au|uk))?)\.?$#i', $domain, $m);
list(, $domain, $suffix) = $m;
mario
  • 144,265
  • 20
  • 237
  • 291
  • Yeh, it surprised me that there wasn't much to be found about this issue - as a relative noob to php (javascript, css & html are my weapons of choice) it seemed rather elementary. .edit: thanks for the reply. Not enough credit yet for an upvote though. 'scuse me. – Dirk v B Mar 15 '11 at 23:44
  • 1
    It will mess up on something like http://www.nic.uk/. You might actually have to maintain the complete list of valid secondary level domains for something like uk. – Matthew Mar 15 '11 at 23:46
  • This is nice and easy, so +1. I'm probably missing something, but do you need that last optional `.` (the `\.?`)? – alex Mar 15 '11 at 23:52
  • @myself, I suppose one could argue that www is the domain and nic.uk is the TLD. Really depends on the context on how correct it is. – Matthew Mar 15 '11 at 23:53
  • @konforce I would even ignore that as special case, or blacklist it (?!nic), but an explicit list `(\w+|co.uk|net.uk|com.au|org.au)` would indeed be most reliable. – mario Mar 15 '11 at 23:55
  • @alex: No, that's not really needed. The trailing root `.` is valid, but an even more unlikely edge case. – mario Mar 15 '11 at 23:56
  • @mario Didn't know that, thanks for teaching me something. Also, if you have a moment, tell me why [my answer](http://stackoverflow.com/questions/5319296/php-url-parsing-disecting/5319313#5319313) is wrong :) – alex Mar 15 '11 at 23:58
  • @alex: Can't say, looks ok. (But: Daily vote limit reached. Wait 1 minute...) – mario Mar 16 '11 at 00:00
  • @mario, I'd probably stick with your original expression if I was looking for a quick way that worked most of the time. Regarding this particular case, `nic.uk` isn't even a valid host, but `www.nic.uk` is. So it seems a bit weird, but there may be nothing wrong with actually considering `www` to be the base domain in this case. – Matthew Mar 16 '11 at 00:02
2

You will need to maintain a list of extensions for most accurate results I believe.

$possibleExtensions = array(
    '.com',
    '.co.uk',
    '.com.au'
);

// parse_url() needs a protocol.
$str = 'http://' . $str;

// Use parse_url() to take into account any paths
// or fragments that may end up being there.
$host = parse_url($str, PHP_URL_HOST);

foreach($possibleExtensions as $ext) {

    if (preg_match('/' . preg_quote($ext, '/') . '\Z/', $host)) {
       $domainNameSuffix = $ext;
       // Strip extension     
       $domainName = substr($str, 0, -strlen($ext));
       // Strip off http://           
       $domainName = substr($domainName, 7);
       var_dump($domainName, $domainNameSuffix);
       break;

    }

}

If you never have any paths or extra stuff, you can of course skip the parse_url() and the http:// adding and removal.

It worked for all your tests.

alex
  • 479,566
  • 201
  • 878
  • 984
2

The "domainNameSuffix" is called a top level domain (tld for short), and there is no easy way to extract it.

Every country has it's own tld, and some countries have opted to further subdivide their tld. And since the number of subdomains (my.own.subdomain.example.com) is also variable, there is no easy "one-regexp-fits-all".

As mentioned, you need a list. Fortunately for you there are lists publicly available: http://publicsuffix.org/

Martin Tournoij
  • 26,737
  • 24
  • 105
  • 146
0

There isn't a builtin function for this.

A quick google search lead me to http://www.wallpaperama.com/forums/php-function-remove-domain-name-get-tld-splitter-split-t5824.html

This leads me to believe you need to maintain a list of valid TLD's to split URLs on.

vicTROLLA
  • 1,554
  • 12
  • 15
  • 2
    instead of maintaining the TLD's your self, why not use a pre maintained one: http://mxr.mozilla.org/mozilla-central/source/netwerk/dns/effective_tld_names.dat – RobertPitt Mar 15 '11 at 23:52
0

Alright chaps, here's how I solved it, for now. Implementation of more domain names will be done as well, at some point in the future. Don't know what technique I'll use, yet.

# Setting options, single and dual part domain extentions
$v2_onePart = array(
                "com"
                );
$v2_twoPart = array(
                "co.uk",
                "com.au"
                );

$v2_url         = $_SERVER['SERVER_NAME'];      # "example.com"     OR  "example.com.au"
$v2_bits        = explode(".", $v2_url);        # "example", "com"  OR  "example", "com", "au"
$v2_bits        = array_reverse($v2_bits);      # "com", "example"  OR  "au", "com", "example"      (Reversing to eliminate foo.bar.example.com.au problems.)

switch ($v2_bits) {
    case in_array($v2_bits[1] . "." . $v2_bits[0], $v2_twoPart):
        $v2_class   = $v2_bits[2] . " " . $v2_bits[1] . "_" . $v2_bits[0];  # "example com_au"
        break;
    case in_array($v2_bits[0], $v2_onePart):
        $v2_class   = $v2_bits[1] . " " . $v2_bits[0];  # "example com"
        break;
}
Dirk v B
  • 365
  • 4
  • 19