0

I have some domains I want to split but can't figure out the regex...

I have:

  • http://www.google.com/tomato
  • http://int.google.com
  • http://google.co.uk

Given any of these, i'm trying to extract google only. Any ideas?

David19801
  • 11,214
  • 25
  • 84
  • 127

4 Answers4

3

why you trying to use regex ? there's plenty of native functions available for you, such as:

$host = parse_url($url, PHP_URL_HOST);

update, give this a go, it may need improving but its better than Regex imo

function determainDomainName($url)
{
    $hostname = parse_url($url, PHP_URL_HOST);
    $parts = explode(".",$hostname);

    switch(count($parts))
    {
        case 1:
             return $parts[0]; //has to be a .com etc
        break;
        case 2:
            if($parts[1] == "www") //The most common subdomain
            {
                return $parts[2]; //Bypass Subdomain / return next segment
            }

            if($parts[2] == "co") //Possible in_array here for multiples, but first segment of double barrel tld
            {
                return $parts[1]; //Bypass double barrel tld's
            }
        break;
        default:
            //Have a guess
            //I bet the longest word is the domain :)
            usort($parts,"mysort");
            return $parts[0];

            /*
            here we just order the array by the longest word
            so google will always come above the following
            com,co,uk,www,cdn,ww1,ww2 etc
            */
        break;
    }
}

function mysort($a,$b){
    return strlen($b) - strlen($a);
}

Add the following 2 functions to your libraries etc.

Then use like so:

$urls = array(
    'http://www.google.com/tomato',
    'http://int.google.com',
    'http://google.co.uk'
);

foreach($urls as $url)
{
    echo determainDomainName($url) . "\n";
}

They will all echo google

see @ http://codepad.org/pA5KWckb

RobertPitt
  • 56,863
  • 21
  • 114
  • 161
  • Updated, to pro-grammatically detect the domain name, you should not try and rely on Regex as it can get very dependant and messy. – RobertPitt Feb 10 '11 at 22:34
0

The answer here might be what you're looking for.

Getting parts of a URL (Regex)

Community
  • 1
  • 1
aendra
  • 5,286
  • 3
  • 38
  • 57
  • That said, regex is very memory intensive. I'm guessing parse_url(); is much less resource hungry than the regex link I posted. – aendra Feb 10 '11 at 22:04
0
$res = preg_replace( "/^(http:\/\/)([a-z_\-]+\.)*([a-z_\-]+)\.(com|co.uk|net)\/.*$/im", "\$3", $in );

Add as much endings as you know

Edit: made a mistake :-(

SergeS
  • 11,533
  • 3
  • 29
  • 35
0

You can do this on a best bet basis. The last part of the URL is always the TLD (and optional root). And you are basically looking for any preceeding word that is longer than 2 letters:

$url = "http://www.google.co.uk./search?q=..";

preg_match("#http://
            (?:[^/]+\.)*       # cut off any preceeding www*
            ([\w-]{3,})        # main domain name
            (\.\w\w)?          # two-letter second level domain .co
            \.\w+\.?           # TLD
            (/|:|$)            # end regex with / or : or string end
            #x", 
      $url, $match);

If you expect any longer second-level domains (.com maybe?) then add another \w. But this is not very generic, you would actually need a list for TLDs were this was allowed.

mario
  • 144,265
  • 20
  • 237
  • 291