0

I am attempting to compare domain names from urls returned by search queries, and thus need to parse the domain name from the url given. Seeing as how this is a commonly asked question I have found a well-encompassing solution here, code supplied below for convenience:

$url="http://www.seventymm.co.in/browse/buy-home-furnishing-bed-sheets-pack-of-3/2456/1/v2034/0/0/1/2/1/0/go";

echo get_domain($url);

function get_domain($url)
{
  $urlobj=parse_url($url);
  $domain=$urlobj['host'];
  if (preg_match('/(?P<domain>[a-z0-9][a-z0-9\-]{1,63}\.[a-z\.]{2,6})$/i', $domain, $regs)) {
    return $regs['domain'];
  }
  return false;
}

For most urls this will strip the subdomains and subdirectories from a given url, returning something like example.com or example.co.uk. This function seems to fail for shorter domain names, which I assume it confuses as a top-level domain and parses incorrectly, such as for http://mmqb.si.com/ and https://www.si.com/ returning mmqb.si.com and www.si.com respectively. For my purposes, I would expect both inputs to return si.com. Is there any way to parse these urls to si.com, whilst still allowing urls to be parsed to outputs like example.co.uk? Ideally I'd like to accomplish this without hardcoding any reference values to check against, such as placing co and uk on a list of accepted top-level domains.

EDIT ADDRESSING DUP

I realize many answers are out there showing how to extract a host name from a url, as I have linked one. My question is specifically about how these answers parse short domain names (1 or 2 letters), often considering them to be top-level domains and resulting in a bad parse. I am looking for a function to accurately parse short domains from urls.

yanman1234
  • 1,009
  • 9
  • 27
  • 1
    you know there is a native php function for that? [parse_url](http://php.net/manual/de/function.parse-url.php) – Jeff Jul 06 '17 at 15:23
  • That function is called within the code listed. The native function returns subdomains, which I do not want and the regex does a pretty good job of removing, except with the anomaly my question refers to. – yanman1234 Jul 06 '17 at 15:25
  • Possible duplicate of [PHP Getting Domain Name From Subdomain](https://stackoverflow.com/questions/1201194/php-getting-domain-name-from-subdomain) – OuailB Jul 06 '17 at 15:32
  • @OuailB I checked out that answer and its references, and unless I missed something, the algorithms proposed will all consider si as a top-level domain, rather than a domain as it is. – yanman1234 Jul 06 '17 at 15:40
  • "I am looking for a function to accurately parse short domains from urls" - this is the wrong site to find someone to write code for you. That regex you cited is bad in several regards. You get the host name from parse_url() - I don't know what you mean by "short domains". – symcbean Jul 06 '17 at 15:55
  • I am not looking for someone to write code for me, and I never said such. "Looking for a function..." means just that. Both Jeff and OUailB pointed me to possible solutions without writing code and if someone noticed a change in the posted code that could be made, that could be noted. The cited regex was seen and upvoted on many questions here, so I took it as a promising solution. Short domain names I have defined as "1 or 2 letter names" referring to the website title. Such as "si" in "si.com". – yanman1234 Jul 06 '17 at 16:00

0 Answers0