19

How can I extract only top-level and second-level domain from a URL using regex? I want to skip all lower level domains. Any ideas?

mel
  • 1,566
  • 5
  • 17
  • 29

9 Answers9

19

Here's my idea,

Match anything that isn't a dot, three times, from the end of the line using the $ anchor.

The last match from the end of the string should be optional to allow for .com.au or .co.nz type of domains.

Both the last and second last matches will only match 2-3 characters, so that it doesn't confuse it with a second-level domain name.


Regex:

[^.]*\.[^.]{2,3}(?:\.[^.]{2,3})?$


Demonstration:

Regex101 Example

Vasili Syrakis
  • 9,321
  • 1
  • 39
  • 56
17

Updated 2019

This is an old question, and the challenge here is a lot more complicated as we start adding new vanity TLDs and more ccTLD second level domains (e.g. .co.uk, .org.uk). So much so, that a regular expression is almost guaranteed to return false positives or negatives.

The only way to reliably get the primary host is to call out to a service that knows about them, like the Public Suffix List.

There are several open-source libraries out there that you can use, like psl, or you can write your own.

Usage for psl is quite intuitive. From their docs:

var psl = require('psl');

// Parse domain without subdomain
var parsed = psl.parse('google.com');
console.log(parsed.tld); // 'com'
console.log(parsed.sld); // 'google'
console.log(parsed.domain); // 'google.com'
console.log(parsed.subdomain); // null

// Parse domain with subdomain
var parsed = psl.parse('www.google.com');
console.log(parsed.tld); // 'com'
console.log(parsed.sld); // 'google'
console.log(parsed.domain); // 'google.com'
console.log(parsed.subdomain); // 'www'

// Parse domain with nested subdomains
var parsed = psl.parse('a.b.c.d.foo.com');
console.log(parsed.tld); // 'com'
console.log(parsed.sld); // 'foo'
console.log(parsed.domain); // 'foo.com'
console.log(parsed.subdomain); // 'a.b.c.d'

Old answer

You could use this:

(\w+\.\w+)$

Without more details (a sample file, the language you're using), it's hard to discern exactly whether this will work.

Example: http://regex101.com/r/wD8eP2

Community
  • 1
  • 1
brandonscript
  • 68,675
  • 32
  • 163
  • 220
4

Also, you can likely do that with some expression similar to,

^(?:https?:\/\/)(?:w{3}\.)?.*?([^.\r\n\/]+\.)([^.\r\n\/]+\.[^.\r\n\/]{2,6}(?:\.[^.\r\n\/]{2,6})?).*$

and add as much as capturing groups that you want to capture the components of a URL.

Demo


If you wish to simplify/modify/explore the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.


RegEx Circuit

jex.im visualizes regular expressions:

enter image description here

Community
  • 1
  • 1
Emma
  • 27,428
  • 11
  • 44
  • 69
  • Unfortunately it didn't match `https://stackoverflow.com/questions/21173734/extracting-top-level-and-second-level-domain-from-a-url-using-regex`. – USauter Feb 18 '22 at 10:10
3

For anyone using JavaScript and wanting a simple way to extract the top and second level domains, I ended up doing this:

'example.aus.com'.match(/\.\w{2,3}\b/g).join('')

This matches anything with a period followed by two or three characters and then a word boundary.

Here's some example outputs:

'example.aus.com'       // .aus.com
'example.austin.com'    // .austin.com
'example.aus.com/howdy' // .aus.com
'example.co.uk/howdy'   // .co.uk

Some people might need something a bit cleverer, but this was enough for me with my particular dataset.

Edit

I've realised there are actually quite a few second-level domains which are longer than 3 characters (and allowed). So, again for simplicity, I just removed the character counting element of my regex:

'example.aus.com'.match(/\.\w*\b/g).join('')
Hamza Anis
  • 2,475
  • 1
  • 26
  • 36
shennan
  • 10,798
  • 5
  • 44
  • 79
  • The OP asked to exclude any lower level domains, e.g. lowerlevel.domain.co.uk using your example gives '.domain.co.uk'. Also doesn't handle URL starting with http:// or http:// – Davos Dec 27 '17 at 09:56
0

Since TLDs now include things with more than three-characters like .wang and .travel, here's a regex that satisfies these new TLDs:

([^.\s]+\.[^.\s]+)$

Strategy: starting at the end of the string, look for one or more characters that aren't periods or whitespace, followed by a single period, followed by one or more characters that aren't periods or whitespace.

http://regexr.com/3bmb3

twink_ml
  • 512
  • 3
  • 14
0

With capturing groups you can achieve some magix.

For example, consider the following javascript:

let hostname = 'test.something.else.be';
let domain = hostname.replace(/^.+\.([^\.]+\.[^\.]+)$/, '$1');

document.write(domain);

This will result in a string containing 'else.com'. This is because the regex itself will match the complete string and the capturing group will be mapped to $1. So it replaces the complete string 'test.something.else.com' with '$1' which is actually 'else.com'.

The regex isn't pretty and can probably be made more dynamic with things like {3} for defining how many levels deep you want to look for subdomains, but this is just an illustration.

robbe clerckx
  • 415
  • 5
  • 16
0

if you want all specific Top Level Domain name then you can write regular expression like this:

[RegularExpression("^(https?:\\/\\/)?(([\\w]+)?\\.?(\\w+\\.((za|zappos|zara|zero|zip|zippo|zm|zone|zuerich|zw))))\\/?$", ErrorMessage = "Is not a valid fully-qualified URL.")]

You can also put more domain name from this link:

https://www.icann.org/resources/pages/tlds-2012-02-25-en

Sam
  • 1
0

The following regex matches a domain with root and tld extractions (named capture groups) from a url or domain string:

(?:\w+:\/{2})?(?<cs_domain>(?<cs_domain_sub>(?:[\w\-]+\.)*?)(?<cs_domain_root>[\w\-]+(?<cs_domain_tld>(?:\.\w{2})?(?:\.\w{2,3}|\.xn-+\w+|\.site|\.club))))\|

It's hard to say if it is perfect, but it works on all the test data sets that I have put it against including .club, .xn-1234, .co.uk, and other odd endings. And it does it in 5556 steps against 40k chars of logs, so the efficiency seems reasonable too.

landen99
  • 51
  • 5
-3

If you need to be more specific:

/\.(?:nl|se|no|es|milru|fr|es|uk|ca|de|jp|au|us|ch|it|io|org|com|net|int|edu|mil|arpa)/

Based on http://www.seobythesea.com/2006/01/googles-most-popular-and-least-popular-top-level-domains/

Dorian
  • 22,759
  • 8
  • 120
  • 116
  • References a very old article (10 years old at time of writing). There are dozens more TLDs now. This could mislead readers to think this is a complete list – Digs Jul 07 '17 at 10:29
  • @Digs You are right, I'm still looking for the full list of TLDs – Dorian Jul 07 '17 at 14:26
  • 1
    That's a nearly impossible task with new generic TLDs coming out all the time. .christmas, .london, .bar, .bank? See https://newgtlds.icann.org/en/announcements-and-media/case-studies Probably best to use one of the regex's mentioned in the other answers (eg.: `\.[a-z]{2,3}(\.[a-z]{2,3})?`) – Digs Jul 07 '17 at 18:11