domain regex split

Question

I have some domains I want to split but can't figure out the regex...

I have:

http://www.google.com/tomato
http://int.google.com
http://google.co.uk

Given any of these, i'm trying to extract google only. Any ideas?

By what rule would you extract `google` from the third example and not `co`? — Pekka, Feb 10 '11 at 21:59
Exactly! I tried a regex on "//" but it didn't work...maybe a double preg_split is required? — David19801, Feb 10 '11 at 22:02
For domain validation, see http://stackoverflow.com/a/16491074/112731 — Onur Yıldırım, Feb 06 '15 at 13:40

RobertPitt · Answer 1 · 2011-02-10T22:33:53.990

why you trying to use regex ? there's plenty of native functions available for you, such as:

$host = parse_url($url, PHP_URL_HOST);

update, give this a go, it may need improving but its better than Regex imo

function determainDomainName($url)
{
    $hostname = parse_url($url, PHP_URL_HOST);
    $parts = explode(".",$hostname);

    switch(count($parts))
    {
        case 1:
             return $parts[0]; //has to be a .com etc
        break;
        case 2:
            if($parts[1] == "www") //The most common subdomain
            {
                return $parts[2]; //Bypass Subdomain / return next segment
            }

            if($parts[2] == "co") //Possible in_array here for multiples, but first segment of double barrel tld
            {
                return $parts[1]; //Bypass double barrel tld's
            }
        break;
        default:
            //Have a guess
            //I bet the longest word is the domain :)
            usort($parts,"mysort");
            return $parts[0];

            /*
            here we just order the array by the longest word
            so google will always come above the following
            com,co,uk,www,cdn,ww1,ww2 etc
            */
        break;
    }
}

function mysort($a,$b){
    return strlen($b) - strlen($a);
}

Add the following 2 functions to your libraries etc.

Then use like so:

$urls = array(
    'http://www.google.com/tomato',
    'http://int.google.com',
    'http://google.co.uk'
);

foreach($urls as $url)
{
    echo determainDomainName($url) . "\n";
}

They will all echo google

see @ http://codepad.org/pA5KWckb

Updated, to pro-grammatically detect the domain name, you should not try and rely on Regex as it can get very dependant and messy. — RobertPitt, Feb 10 '11 at 22:34

score 0 · Answer 2 · edited May 23 '17 at 12:13

0

The answer here might be what you're looking for.

Getting parts of a URL (Regex)

edited May 23 '17 at 12:13

Community

1
1

answered Feb 10 '11 at 22:01

aendra

5,286
3
38
57

That said, regex is very memory intensive. I'm guessing parse_url(); is much less resource hungry than the regex link I posted. – aendra Feb 10 '11 at 22:04

score 0 · Answer 3 · answered Feb 10 '11 at 22:02

0

$res = preg_replace( "/^(http:\/\/)([a-z_\-]+\.)*([a-z_\-]+)\.(com|co.uk|net)\/.*$/im", "\$3", $in );

Add as much endings as you know

Edit: made a mistake :-(

answered Feb 10 '11 at 22:02

SergeS

11,533
3
29
35

score 0 · Accepted Answer · answered Feb 10 '11 at 22:15

You can do this on a best bet basis. The last part of the URL is always the TLD (and optional root). And you are basically looking for any preceeding word that is longer than 2 letters:

$url = "http://www.google.co.uk./search?q=..";

preg_match("#http://
            (?:[^/]+\.)*       # cut off any preceeding www*
            ([\w-]{3,})        # main domain name
            (\.\w\w)?          # two-letter second level domain .co
            \.\w+\.?           # TLD
            (/|:|$)            # end regex with / or : or string end
            #x", 
      $url, $match);

If you expect any longer second-level domains (.com maybe?) then add another \w. But this is not very generic, you would actually need a list for TLDs were this was allowed.

domain regex split

4 Answers4

Linked