I am attempting to compare domain names from urls returned by search queries, and thus need to parse the domain name from the url given. Seeing as how this is a commonly asked question I have found a well-encompassing solution here, code supplied below for convenience:
$url="http://www.seventymm.co.in/browse/buy-home-furnishing-bed-sheets-pack-of-3/2456/1/v2034/0/0/1/2/1/0/go";
echo get_domain($url);
function get_domain($url)
{
$urlobj=parse_url($url);
$domain=$urlobj['host'];
if (preg_match('/(?P<domain>[a-z0-9][a-z0-9\-]{1,63}\.[a-z\.]{2,6})$/i', $domain, $regs)) {
return $regs['domain'];
}
return false;
}
For most urls this will strip the subdomains and subdirectories from a given url, returning something like example.com or example.co.uk. This function seems to fail for shorter domain names, which I assume it confuses as a top-level domain and parses incorrectly, such as for http://mmqb.si.com/ and https://www.si.com/ returning mmqb.si.com and www.si.com respectively. For my purposes, I would expect both inputs to return si.com. Is there any way to parse these urls to si.com, whilst still allowing urls to be parsed to outputs like example.co.uk? Ideally I'd like to accomplish this without hardcoding any reference values to check against, such as placing co and uk on a list of accepted top-level domains.
EDIT ADDRESSING DUP
I realize many answers are out there showing how to extract a host name from a url, as I have linked one. My question is specifically about how these answers parse short domain names (1 or 2 letters), often considering them to be top-level domains and resulting in a bad parse. I am looking for a function to accurately parse short domains from urls.