Extracting top-level and second-level domain from a URL using regex

Question

How can I extract only top-level and second-level domain from a URL using regex? I want to skip all lower level domains. Any ideas?

Vasili Syrakis · Answer 1 · 2015-02-11T23:08:53.487

19

Here's my idea,

Match anything that isn't a dot, three times, from the end of the line using the $ anchor.

The last match from the end of the string should be optional to allow for .com.au or .co.nz type of domains.

Both the last and second last matches will only match 2-3 characters, so that it doesn't confuse it with a second-level domain name.

Regex:

[^.]*\.[^.]{2,3}(?:\.[^.]{2,3})?$

Demonstration:

Regex101 Example

edited Feb 11 '15 at 23:08

answered Jan 16 '14 at 22:41

Vasili Syrakis

9,321
1
39
56

what about top level domains like "police.uk" or "parliament.uk" etc. More about .uk domains it here: https://en.wikipedia.org/wiki/.uk – LukasMac Jun 25 '15 at 11:55
This regex is good for only domain url and fails for full length url. Ex: "www.google.com.bd/abc" will return "com.bd/abc" – priojeet priyom Sep 14 '19 at 06:44
1

This will also now fail for any new TLDs like .computer or .business. – brandonscript Oct 13 '19 at 22:27
This also doesn't work for 3 letter domain names like www.rgj.com or account.app.com... – Oskar Austegard Dec 02 '19 at 23:05
The domain zone might be more than 3 letter, i.e. ".agency" – Konstantin Bogomolov Jan 15 '20 at 09:44
Globally it is bad nawadays to focus on numbers of letters for tld, sld, domain, scheme, host and path. It didn't match complexe as https://subdomain.sld.tld/folder_domain. – Maxime Culea Feb 17 '20 at 10:06
downvoted for reasons above. Use the Publix Suffix List https://publicsuffix.org/ – ChatGPT Apr 18 '20 at 02:31

score 17 · Answer 2 · edited Jun 20 '20 at 09:12

Updated 2019

This is an old question, and the challenge here is a lot more complicated as we start adding new vanity TLDs and more ccTLD second level domains (e.g. .co.uk, .org.uk). So much so, that a regular expression is almost guaranteed to return false positives or negatives.

The only way to reliably get the primary host is to call out to a service that knows about them, like the Public Suffix List.

There are several open-source libraries out there that you can use, like psl, or you can write your own.

Usage for psl is quite intuitive. From their docs:

var psl = require('psl');

// Parse domain without subdomain
var parsed = psl.parse('google.com');
console.log(parsed.tld); // 'com'
console.log(parsed.sld); // 'google'
console.log(parsed.domain); // 'google.com'
console.log(parsed.subdomain); // null

// Parse domain with subdomain
var parsed = psl.parse('www.google.com');
console.log(parsed.tld); // 'com'
console.log(parsed.sld); // 'google'
console.log(parsed.domain); // 'google.com'
console.log(parsed.subdomain); // 'www'

// Parse domain with nested subdomains
var parsed = psl.parse('a.b.c.d.foo.com');
console.log(parsed.tld); // 'com'
console.log(parsed.sld); // 'foo'
console.log(parsed.domain); // 'foo.com'
console.log(parsed.subdomain); // 'a.b.c.d'

Old answer

You could use this:

(\w+\.\w+)$

Without more details (a sample file, the language you're using), it's hard to discern exactly whether this will work.

Example: http://regex101.com/r/wD8eP2

score 4 · Answer 3 · edited Jun 20 '20 at 09:12

4

Also, you can likely do that with some expression similar to,

^(?:https?:\/\/)(?:w{3}\.)?.*?([^.\r\n\/]+\.)([^.\r\n\/]+\.[^.\r\n\/]{2,6}(?:\.[^.\r\n\/]{2,6})?).*$

and add as much as capturing groups that you want to capture the components of a URL.

Demo

If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.

RegEx Circuit

jex.im visualizes regular expressions:

edited Jun 20 '20 at 09:12

Community

1
1

answered Oct 25 '19 at 23:51

Emma

27,428
11
44
69

Unfortunately it didn't match `https://stackoverflow.com/questions/21173734/extracting-top-level-and-second-level-domain-from-a-url-using-regex`. – USauter Feb 18 '22 at 10:10

score 3 · Answer 4 · edited Aug 17 '20 at 12:24

For anyone using JavaScript and wanting a simple way to extract the top and second level domains, I ended up doing this:

'example.aus.com'.match(/\.\w{2,3}\b/g).join('')

This matches anything with a period followed by two or three characters and then a word boundary.

Here's some example outputs:

'example.aus.com'       // .aus.com
'example.austin.com'    // .austin.com
'example.aus.com/howdy' // .aus.com
'example.co.uk/howdy'   // .co.uk

Some people might need something a bit cleverer, but this was enough for me with my particular dataset.

Edit

I've realised there are actually quite a few second-level domains which are longer than 3 characters (and allowed). So, again for simplicity, I just removed the character counting element of my regex:

'example.aus.com'.match(/\.\w*\b/g).join('')

The OP asked to exclude any lower level domains, e.g. lowerlevel.domain.co.uk using your example gives '.domain.co.uk'. Also doesn't handle URL starting with http:// or http:// — Davos, Dec 27 '17 at 09:56

score 0 · Answer 5 · answered Aug 29 '15 at 21:40

0

Since TLDs now include things with more than three-characters like .wang and .travel, here's a regex that satisfies these new TLDs:

([^.\s]+\.[^.\s]+)$

Strategy: starting at the end of the string, look for one or more characters that aren't periods or whitespace, followed by a single period, followed by one or more characters that aren't periods or whitespace.

http://regexr.com/3bmb3

answered Aug 29 '15 at 21:40

twink_ml

512
3
14

Unfortunately it doesn't work on two-part TLDs such as https://www.google.co.uk/ – Garrulinae Jun 19 '23 at 02:53

score 0 · Answer 6 · answered Apr 09 '18 at 19:03

With capturing groups you can achieve some magix.

For example, consider the following javascript:

let hostname = 'test.something.else.be';
let domain = hostname.replace(/^.+\.([^\.]+\.[^\.]+)$/, '$1');

document.write(domain);

This will result in a string containing 'else.com'. This is because the regex itself will match the complete string and the capturing group will be mapped to $1. So it replaces the complete string 'test.something.else.com' with '$1' which is actually 'else.com'.

The regex isn't pretty and can probably be made more dynamic with things like {3} for defining how many levels deep you want to look for subdomains, but this is just an illustration.

score 0 · Answer 7 · answered Jul 30 '18 at 11:03

if you want all specific Top Level Domain name then you can write regular expression like this:

[RegularExpression("^(https?:\\/\\/)?(([\\w]+)?\\.?(\\w+\\.((za|zappos|zara|zero|zip|zippo|zm|zone|zuerich|zw))))\\/?$", ErrorMessage = "Is not a valid fully-qualified URL.")]

You can also put more domain name from this link:

https://www.icann.org/resources/pages/tlds-2012-02-25-en

score 0 · Answer 8 · answered Sep 30 '20 at 14:48

The following regex matches a domain with root and tld extractions (named capture groups) from a url or domain string:

(?:\w+:\/{2})?(?<cs_domain>(?<cs_domain_sub>(?:[\w\-]+\.)*?)(?<cs_domain_root>[\w\-]+(?<cs_domain_tld>(?:\.\w{2})?(?:\.\w{2,3}|\.xn-+\w+|\.site|\.club))))\|

It's hard to say if it is perfect, but it works on all the test data sets that I have put it against including .club, .xn-1234, .co.uk, and other odd endings. And it does it in 5556 steps against 40k chars of logs, so the efficiency seems reasonable too.

score -3 · Answer 9 · answered Mar 16 '17 at 04:35

-3

If you need to be more specific:

/\.(?:nl|se|no|es|milru|fr|es|uk|ca|de|jp|au|us|ch|it|io|org|com|net|int|edu|mil|arpa)/

Based on http://www.seobythesea.com/2006/01/googles-most-popular-and-least-popular-top-level-domains/

answered Mar 16 '17 at 04:35

Dorian

22,759
8
120
116

References a very old article (10 years old at time of writing). There are dozens more TLDs now. This could mislead readers to think this is a complete list – Digs Jul 07 '17 at 10:29
@Digs You are right, I'm still looking for the full list of TLDs – Dorian Jul 07 '17 at 14:26
1

That's a nearly impossible task with new generic TLDs coming out all the time. .christmas, .london, .bar, .bank? See https://newgtlds.icann.org/en/announcements-and-media/case-studies Probably best to use one of the regex's mentioned in the other answers (eg.: `\.[a-z]{2,3}(\.[a-z]{2,3})?`) – Digs Jul 07 '17 at 18:11

Extracting top-level and second-level domain from a URL using regex

9 Answers9

Updated 2019

Demo

RegEx Circuit

Linked

Related