1

I want to extract the subdomain and domain part for domains with arbitrary top level extensions.

Thus:

sub1.domain1.com --> Extract subdomain=sub1, domain=domain1.com

sub2.domain2.co.in --> Extract subdomain=sub2, domain=domain2.co.in

sub3.domain3.co.uk --> Extract subdomain=sub3, domain=domain3.co.uk

sub4.domain4.us --> Extract subdomain=sub4, domain=domain4.us

mydomain.com --> Extract subdomain="", domain=mydomain.com

mydomain.co.in --> Extract subdomain="", domain=mydomain.co.in

I am bit confused about how to handle TLDs like co.in/co.uk etc. I could do this using brute force way by measuring if the last 5 characters have a DOT (.) in them, but thinking if there is a regex way to do this.


NOTE 1: As TToni pointed out, there can be ambiguities. However, I will put some constraints:

1) The "Domain name" part (without the extension) --> will be at-least 4 characters.

2) The TLD extension part (.com, co.in, .us, etc) will either have a single DOT or if it has two DOTS, then the penultimate part (sub TLD) will have atmost 3 characters.

I have a feeling that these constraints will make the problem unambigious and solvable using regex.

(Also, assume "www." has been stripped out already).


NOTE 2:

Example of above constraints

sub.dom.in --> domain="sub.dom.in"

sub.dom1.in --> domain="dom1.in", subdomain="sub"

This may sound buggy, but the reason is - I want this for my internal purposes, and all my domains have atleast 4 characters in them, AND, all extensions have either single DOT or the penultimate part is at-max 3 characters.


NOTE 3: I have a feeling I might make mistakes by using regex for this. Hence thinking of doing the string search way.

regards,

JP

  • Not quite the same, but take a look at http://stackoverflow.com/questions/3853338/remove-domain-extension/3853473#3853473 – Gumbo Nov 29 '10 at 14:14
  • 1
    I think you cannot fully solve this with a regex because you get ambiguities. Consider "b.c.eu" for example. Which one is the domain? – TToni Nov 29 '10 at 14:15
  • I agree with TToni. I will ammend my question. For my purpose, assume that domain name will be at-least 4 characters. Will also add one more constraint after wording it formally. –  Nov 29 '10 at 14:18
  • So, the domain is "all non-dot characters immediately before the first dot which occurs at least three characters from the end, and all the characters which occur after them", and the subdomain is "everything that's not in the domain, without the final dot"? – Curtis Nov 29 '10 at 14:38
  • 1
    "solvable using regex" Just because you have a hammer doesn't mean your problem is a nail – The Archetypal Paul Nov 29 '10 at 14:43
  • @ Paul: lol.... I have ditched the idea of regex for this one. But thanks all for the suggestions and time. –  Nov 29 '10 at 14:50
  • @JP19, I'm a fan of "Say what you mean, simply and directly" (from Kernighan and Pike's classic elements of Programming Style). Sometimes that's with regexps but there is a tendency (quite evident on SO) to overuse that particular golden hammer – The Archetypal Paul Nov 29 '10 at 16:23

4 Answers4

4

Not sure you need regexes. Split the domain name on '.' then apply some heuristics on the result depending on the rightmost bit - e..g if last is "com" then domain is last+second last, subdomain is the rest.

Or keep a list of "top-level" (quotes becasue it's a different meaning from the normal top level)domains, iterate over the list matching the right end of the domain name against each. If a match, remove the top level bit and return the rest as subdomain - this could be put in a regex but with a loss of clarity. The list would look something like

".edu", ".gov", ".mil", ".com", ".co.uk", ".gov.uk", ".nhs.uk", [...]

The regex would look something like

 \.(edu|gov|mil|com|co\.uk|gov\.uk|nhs\.uk|[...])$
Bart Kiers
  • 166,582
  • 36
  • 299
  • 288
The Archetypal Paul
  • 41,321
  • 20
  • 104
  • 134
  • `\.(edu)|(com)$` matches either `.edu` (not necessarily followed by the end-of-input) or `com` followed by the end-of-input (without the `.`!). You probably meant `\.(edu|com|mil|etc)$`. Also, putting `[..]` in a regex might be perceived as an odd (but legal) character class, whereas you meant it to be something else. – Bart Kiers Nov 29 '10 at 14:34
  • Thanks, typed too quickly. Fixed. And yes, the [...] is meant to be mean "and so on" – The Archetypal Paul Nov 29 '10 at 14:39
  • Yeah, I figured that. I took the liberty to fix the un-escaped `.`'s in your example regex. – Bart Kiers Nov 29 '10 at 14:43
0

You can use regex and any internal function, but you'll never have correct result on complex domain zones (.co.uk, .a.bg, .fuso.aichi.jp, etc.).

You need use library that uses Public Suffix List for correct extraction. I recomend TLDExtract.

Here is a sample code:

$extract = new LayerShifter\TLDExtract\Extract();

$result = $extract->parse('mydomain.co.in');
$result->getSubdomain(); // will return null
$result->getHostname(); // will return 'mydomain'
$result->getSuffix(); // will return 'co.in'
$result->getFullHost(); // will return 'mydomain.co.in'
$result->getRegistrableDomain(); // will return 'mydomain.co.in'
Oleksandr Fediashov
  • 4,315
  • 1
  • 24
  • 42
0

You can use this: (\b\w+\b(?:\.\b\w+\b)*?){0,1}?\.?(\b\w+\b(?:\.\b\w{1,3}\b)?\.\b\w{1,3}\b)
It doesn't look very beautiful, but the idea behind it is simple. It will catch subdomain in the first group and domain in the second. Also it will split things like "sub1.sub2.sub3.domain2.co.in" into "sub1.sub2.sub3" and "domain2.co.in"

alpha-mouse
  • 4,953
  • 24
  • 36
  • The problem is that you cannot know what the actual domain is. In the case of the sample: domain2.co.in "co" might also be the domain (e.g. co.com). So you need to use a list of all toplevel domains. – morja Nov 29 '10 at 17:10
0

I got the "top-level" domain names,it might be ugly but it works.

$fix = array('com', 'edu', 'gov', 'int', 'mil', 'net', 'org', 'biz', 'info', 'pro', 'name', 'museum', 'coop', 'aero', 'x    xx', 'idv', 'al', 'dz', 'af', 'ar', 'ae', 'aw', 'om', 'az', 'eg', 'et', 'ie', 'ee', 'ad', 'ao', 'ai', 'ag', 'at', 'au',     'mo', 'bb', 'pg', 'bs', 'pk', 'py', 'ps', 'bh', 'pa', 'br', 'by', 'bm', 'bg', 'mp', 'bj', 'be', 'is', 'pr', 'ba', 'pl',     'bo', 'bz', 'bw', 'bt', 'bf', 'bi', 'bv', 'kp', 'gq', 'dk', 'de', 'tl', 'tp', 'tg', 'dm', 'do', 'ru', 'ec', 'er', 'fr',     'fo', 'pf', 'gf', 'tf', 'va', 'ph', 'fj', 'fi', 'cv', 'fk', 'gm', 'cg', 'cd', 'co', 'cr', 'gg', 'gd', 'gl', 'ge', 'cu',     'gp', 'gu', 'gy', 'kz', 'ht', 'kr', 'nl', 'an', 'hm', 'hn', 'ki', 'dj', 'kg', 'gn', 'gw', 'ca', 'gh', 'ga', 'kh', 'cz',     'zw', 'cm', 'qa', 'ky', 'km', 'ci', 'kw', 'cc', 'hr', 'ke', 'ck', 'lv', 'ls', 'la', 'lb', 'lt', 'lr', 'ly', 'li', 're',     'lu', 'rw', 'ro', 'mg', 'im', 'mv', 'mt', 'mw', 'my', 'ml', 'mk', 'mh', 'mq', 'yt', 'mu', 'mr', 'us', 'um', 'as', 'vi',     'mn', 'ms', 'bd', 'pe', 'fm', 'mm', 'md', 'ma', 'mc', 'mz', 'mx', 'nr', 'np', 'ni', 'ne', 'ng', 'nu', 'no', 'nf', 'na',     'za', 'aq', 'gs', 'eu', 'pw', 'pn', 'pt', 'jp', 'se', 'ch', 'sv', 'ws', 'yu', 'sl', 'sn', 'cy', 'sc', 'sa', 'cx', 'st',     'sh', 'kn', 'lc', 'sm', 'pm', 'vc', 'lk', 'sk', 'si', 'sj', 'sz', 'sd', 'sr', 'sb', 'so', 'tj', 'tw', 'th', 'tz', 'to',     'tc', 'tt', 'tn', 'tv', 'tr', 'tm', 'tk', 'wf', 'vu', 'gt', 've', 'bn', 'ug', 'ua', 'uy', 'uz', 'es', 'eh', 'gr', 'hk',     'sg', 'nc', 'nz', 'hu', 'sy', 'jm', 'am', 'ac', 'ye', 'iq', 'ir', 'il', 'it', 'in', 'id', 'uk', 'vg', 'io', 'jo', 'vn',     'zm', 'je', 'td', 'gi', 'cl', 'cf', 'cn', 'ac', 'ad', 'ae', 'af', 'ag', 'ai', 'al', 'am', 'an', 'ao', 'aq', 'ar', 'as',     'at', 'au', 'aw', 'az', 'ba', 'bb', 'bd', 'be', 'bf', 'bg', 'bh', 'bi', 'bj', 'bm', 'bn', 'bo', 'br', 'bs', 'bt', 'bv',     'bw', 'by', 'bz', 'ca', 'cc', 'cd', 'cf', 'cg', 'ch', 'ci', 'ck', 'cl', 'cm', 'cn', 'co', 'cr', 'cu', 'cv', 'cx', 'cy',     'cz', 'de', 'dj', 'dk', 'dm', 'do', 'dz', 'ec', 'ee', 'eg', 'eh', 'er', 'es', 'et', 'eu', 'fi', 'fj', 'fk', 'fm', 'fo',     'fr', 'ga', 'gd', 'ge', 'gf', 'gg', 'gh', 'gi', 'gl', 'gm', 'gn', 'gp', 'gq', 'gr', 'gs', 'gt', 'gu', 'gw', 'gy', 'hk',     'hm', 'hn', 'hr', 'ht', 'hu', 'id', 'ie', 'il', 'im', 'in', 'io', 'iq', 'ir', 'is', 'it', 'je', 'jm', 'jo', 'jp', 'ke',     'kg', 'kh', 'ki', 'km', 'kn', 'kp', 'kr', 'kw', 'ky', 'kz', 'la', 'lb', 'lc', 'li', 'lk', 'lr', 'ls', 'lt', 'lu', 'lv',     'ly', 'ma', 'mc', 'md', 'mg', 'mh', 'mk', 'ml', 'mm', 'mn', 'mo', 'mp', 'mq', 'mr', 'ms', 'mt', 'mu', 'mv', 'mw', 'mx',     'my', 'mz', 'na', 'nc', 'ne', 'nf', 'ng', 'ni', 'nl', 'no', 'np', 'nr', 'nu', 'nz', 'om', 'pa', 'pe', 'pf', 'pg', 'ph',     'pk', 'pl', 'pm', 'pn', 'pr', 'ps', 'pt', 'pw', 'py', 'qa', 're', 'ro', 'ru', 'rw', 'sa', 'sb', 'sc', 'sd', 'se', 'sg',     'sh', 'si', 'sj', 'sk', 'sl', 'sm', 'sn', 'so', 'sr', 'st', 'sv', 'sy', 'sz', 'tc', 'td', 'tf', 'tg', 'th', 'tj', 'tk',     'tl', 'tm', 'tn', 'to', 'tp', 'tr', 'tt', 'tv', 'tw', 'tz', 'ua', 'ug', 'uk', 'um', 'us', 'uy', 'uz', 'va', 'vc', 've',     'vg', 'vi', 'vn', 'vu', 'wf', 'ws', 'ye', 'yt', 'yu', 'yr', 'za', 'zm', 'zw');

function get_domain($url){
   global $fix;
   $host =  parse_url($url,PHP_URL_HOST);
   $list = explode('.',$host);
   $res = array();
   $i = count($list) - 1;
   while($i >= 0){ 
      if(!in_array($list[$i],$fix)){
         $res[] = $list[$i];
         break;
      }   
    $res[] = $list[$i];
    $i--;
     }   
    return implode('.',array_reverse($res));
}
zyanlu
  • 181
  • 1
  • 3