10

Is there a way to get top level domain name from the url

for e.g., "https://images.google.com/blah" => "google"

I found this:

var domain = new URL(pageUrl).hostname; 

but it gives me "images.google.com" instead of just google.

Unit tests I have are:

https://images.google.com   => google
https://www.google.com/blah => google
https://www.google.co.uk/blah => google
https://www.images.google.com/blah => google
sublime
  • 4,013
  • 9
  • 53
  • 92
  • possible duplicate of [Get the domain name of the subdomain Javascript](http://stackoverflow.com/questions/13367376/get-the-domain-name-of-the-subdomain-javascript) – Patrick Moore Sep 19 '14 at 21:27
  • 1
    The top level domain is actually the .com part, so I think you're maybe looking for second-level domain. But what would you expect back from something like video.google.co.uk - the "co" (the second-level domain) or "google" or "google.co"? – Bjorn Svensson Sep 19 '14 at 21:30
  • just "google" I have mentioned it in the question – sublime Sep 19 '14 at 21:32
  • I though the real top domain was actually com in your case, google being a subdomain of it? – Stranded Kid Jan 07 '16 at 11:29

7 Answers7

6

You could do this:

location.hostname.split('.').pop()

EDIT

Saw the change to your question, you would need a list of all TLDs to match against and remove from the hostname, then you could use split('.').pop()

// small example list
var re = new RegExp('\.+(co.uk|me|com|us)')
var secondLevelDomain = 'https://www.google.co.uk'.replace(re, '').split('.').pop()
Rob M.
  • 35,491
  • 6
  • 51
  • 50
  • Note that for this, your regex will be pages long, just check the describing pages for each of these: https://en.wikipedia.org/wiki/List_of_Internet_top-level_domains. This also does not cover unofficial domain registers who offer hosting e.g wordpress.com and blogspot.com where the real site is after a / but is seen by owner as "his domain". – edelwater Jul 01 '20 at 20:25
4

This is the simplest solution besides maintaining white & black top level domain lists.

  1. Match on top level domain if it has two or more characters 'xxxx.yyy'

  2. Match on top level domain and sub-domain, if both are under two characters 'xxxxx.yy.zz'

  3. Remove Match.

  4. Return everything between the last period, and the end of the string.


I broke it into two separate OR|regex rules:

  1. (\.[^\.]*)(\.*$) - last period to end of string if top domain is >= 3.
  2. (\.[^\.]{0,2})(\.[^\.]{0,2})(\.*$) - Top and Sub-Domain are <= 2.

var regex_var = new RegExp(/(\.[^\.]{0,2})(\.[^\.]{0,2})(\.*$)|(\.[^\.]*)(\.*$)/);
var unit_test = 'xxx.yy.zz.'.replace(regex_var, '').split('.').pop();
document.write("Returned user entered domain: " + unit_test + "\n");

var result = location.hostname.replace(regex_var, '').split('.').pop();
document.write("Current Domain: " + result);
Null
  • 123
  • 7
  • thank you for solving the second problem! :) Here is a bookmarklet I made to copy a URL formatted in Markdown, with domain name at the end: `javascript:if(typeof%20WxXYnC60==typeof%20alert)WxXYnC60();window.prompt("Copy%20page%20title%20and%20URL","["+document.title+"]("+location.href+")%20("+location.hostname.replace(new%20RegExp(/(\.[^\.]{2})(\.[^\.]{2})(\.*$)|(\.[^\.]*)(\.*$)/),'').split('.').pop()+")");void(0);` – ultracrepidarian Oct 02 '15 at 01:49
  • @Null Will this regex also work with internationalized ccTLDs https://en.wikipedia.org/wiki/Internationalized_country_code_top-level_domain – Vipresh Feb 23 '16 at 06:37
  • While i haven't worked with international encoded domains, as long as there is the same structure of periods to separate top level domains, it will work. – Null Feb 24 '16 at 20:02
4
function getDomainName( hostname ) {
    var TLDs = new RegExp(/\.(com|net|org|biz|ltd|plc|edu|mil|asn|adm|adv|arq|art|bio|cng|cnt|ecn|eng|esp|etc|eti|fot|fst|g12|ind|inf|jor|lel|med|nom|ntr|odo|ppg|pro|psc|psi|rec|slg|tmp|tur|vet|zlg|asso|presse|k12|gov|muni|ernet|res|store|firm|arts|info|mobi|maori|iwi|travel|asia|web|tel)(\.[a-z]{2,3})?$|(\.[^\.]{2,3})(\.[^\.]{2,3})$|(\.[^\.]{2})$/);
    return hostname.replace(TLDs, '').split('.').pop();
}

/*** TEST ***/

var domains = [
    'domain.com',
    'subdomain.domain.com',
    'www.subdomain.domain.com',
    'www.subdomain.domain.info',
    'www.subdomain.domain.info.xx',
    'mail.subdomain.domain.co.uk',
    'mail.subdomain.domain.xxx.yy',
    'mail.subdomain.domain.xx.yyy',
    'mail.subdomain.domain.xx',
    'domain.xx'
];

var result = [];
for (var i = 0; i < domains.length; i++) {
    result.push( getDomainName( domains[i] ) );
}

alert ( result.join(' | ') );

// result: domain | domain | domain | domain | domain | domain | domain | domain | domain | domain
PHC
  • 67
  • 1
3

How about this?

location.hostname.split('.').reverse()[1]

kechol
  • 1,554
  • 2
  • 9
  • 18
1

Here's my naive take on solving the issue.

url.split('.').reverse()[1].split('//').reverse()[0]

Supports subdomains, but won't support public suffix SLDs.

Etienne Martin
  • 10,018
  • 3
  • 35
  • 47
0

What you want to extract from the URL is not the top-level domain (TLD). The TLD is the rightmost part, e.g. .com.

Having said that, I don't think there's an easy way to do this because there's URLs that have two "common" parts like ".co.uk" and I suppose you don't want to exract the ".co" in those cases. You could maybe use a list of existing two-part "TLDs" to check against so that you know when to extract which part.

0

I just wanted to add something since this one comes on top of Google and I was searching for it.

You can download the wikipedia dataset of all urls (33Mb download) and use this as your test set for your test cases. Another test source is the Alexa top 1.000.000 sites and / or download of some populair blogs and parse the urls out of it.

First of all i'm scoping it to retrieve the unique URI for a certain "object". Since every html page can have in principle another favicon to point to in general representing the object. "that what is the domain for the owner". I'm also scoping it to the Alexa top 10.000.000 sites only. And you verify with the Google Favicon service in how far this matches with your own algorithm to receive e.g. favicons and see if they are the same.

  • First of all you need to know the top level domain official. These are here: https://en.wikipedia.org/wiki/List_of_Internet_top-level_domains.
  • clicking each entry gives the official subdomains / suffixes (click the first column in the wikipedia page). These also need to be put in your array. Since everything that "registers" a domain is not the domain to take the favicon from. There are the weirdest combinations out there, and not all are that clear e.g. the ones that are numbered are (e.g. https://en.wikipedia.org/wiki/.bg) but the ones that official based on job types... more vague. All of these are keys in your array. Since on the first place you are looking for the first word behind this. That is the thing someone owns and needs a favicon to represent it. Mozilla maintains a list of this but you will have to append it. This project https://github.com/lupomontero/psl might be helpful (based on https://publicsuffix.org/ ) But I noticed during testing that it does not cover all cases.
  • Then there are "unofficial" domain registers e.g. facebook games are under /facebook.com/xxs and have their own icons. So you need to put this also in the array so you can find the unique icons for these uri's. There are quite some entries in the top Alexa hits that are not the main domain but a /user/john that is the most visited (and which has another icon). Scoping to the top 10.000.000 in Alexa helps to scope this to the most popular stuff only.
  • Once you have this array and you are at 80% matching with your testset you can concentrate on the use cases that are not covered by the above e.g. all kinds of redirects and much weirder stuff like certain nginx servers that server weird http statusses and probably custom modded by someone etc...
  • Another thing to take care if you are using this in globalized/localized application is to have the same concept reference to both a language and the domain e.g. wikipedia.en and wikipedia.nl. In this case the "link you click" to the same concept has to take into account these properties as present in larger portals.
  • What is missing then is that e.g. abcd.com has both defgh.abcd.com and news.abcd.com where defgh.abcd.com is something completely different, or worse redirecting to a completely different company here you need to add some tricks e.g. checking the metadata or the icons to be sure that this still part of the main domain or something completely different.

This is quite some work and keeping it up to date even more. My advise is not to start with the simplistic cases e.g. https://en.wikipedia.org/wiki/.tj but the difficult ones first e.g. https://en.wikipedia.org/wiki/.br. You will need to make it a dictionary / array since ".uk" and ".gov.uk" are different keys.

edelwater
  • 2,650
  • 8
  • 39
  • 67