12

Currently I can extract the 'domain' from any URL with the following regex:

/^(?:https?:\/\/)?(?:[^@\n]+@)?(?:www\.)?([^:\/\n\?\=]+)/im

However I'm also getting subdomain's too which I want to avoid. For example if I have sites:

  • www.google.com
  • yahoo.com/something
  • freds.meatmarket.co.uk?someparameter
  • josh.meatmarket.co.uk/asldf/asdf

I currently get:

  • google.com
  • yahoo.com
  • freds.meatmarket.co.uk
  • josh.meatmarket.co.uk

Those last two I would like to exclude the freds and josh subdomain portion and extract only the true domain which would just be meatmarket.co.uk.

I did find another SOF that tries to solve in PHP, unfortunately I don't know PHP. is this translatable to JS (I'm actually using Google Script FYI)?

  function topDomainFromURL($url) {
    $url_parts = parse_url($url);
    $domain_parts = explode('.', $url_parts['host']);
    if (strlen(end($domain_parts)) == 2 ) { 
      // ccTLD here, get last three parts
      $top_domain_parts = array_slice($domain_parts, -3);
    } else {
      $top_domain_parts = array_slice($domain_parts, -2);
    }
    $top_domain = implode('.', $top_domain_parts);
    return $top_domain;
  }
MarkII
  • 872
  • 1
  • 9
  • 26

6 Answers6

23

So, you need firstmost hostname stripped from your result, unless there only two parts already?

Just postprocess your result from first match with regexp matching that condition:

function domain_from_url(url) {
    var result
    var match
    if (match = url.match(/^(?:https?:\/\/)?(?:[^@\n]+@)?(?:www\.)?([^:\/\n\?\=]+)/im)) {
        result = match[1]
        if (match = result.match(/^[^\.]+\.(.+\..+)$/)) {
            result = match[1]
        }
    }
    return result
}

console.log(domain_from_url("www.google.com"))
console.log(domain_from_url("yahoo.com/something"))
console.log(domain_from_url("freds.meatmarket.co.uk?someparameter"))
console.log(domain_from_url("josh.meatmarket.co.uk/asldf/asdf"))

// google.com
// yahoo.com
// meatmarket.co.uk
// meatmarket.co.uk
Oleg V. Volkov
  • 21,719
  • 4
  • 44
  • 68
  • This looks to be the best solution so far. I think I can mod to exclude bad domains given too such as `something/something/somthing` – MarkII Jan 15 '16 at 19:42
  • @MarkII, yeah, you can string pretty much any other checks you want on top of that. I've also added `^` anchor I forgot in front of my regexp. – Oleg V. Volkov Jan 15 '16 at 19:47
  • This doesn't work for some valid URL parameters e.g. `http://freds.meatmarket.co.uk?someparameter?ordernumber=1234&email=break@regex.com` the subgroup matched is `regex.com` because it is matching on an @ – Davos Jul 17 '17 at 03:50
  • @Davos, this particular solution doesn't touch domain-extracting regexp because OP wanted help with another problem, but yes, this could be fixed as well. – Oleg V. Volkov Jul 17 '17 at 11:25
  • Fair enough, it works for the OPs question, and I just realised it was supplied by the OP in the question, not you. I think that the regex was probably written to account for URLs of the form `http://user@domain.com` and not expect `@` to appear anywhere else. – Davos Jul 18 '17 at 04:07
  • ^(?:https?:\/\/)?(?:[^@\n]+@)?(?:www\d?\.)?([^:\/\n\?\=]+) added \d? – dobeerman Feb 15 '18 at 12:17
  • 4
    This does not work as advertised. `readDomain('https://www.ebay.com/sh/ord') -> "ebay.com"` and `readDomain('https://www.ebay.co.uk/sh/ord') -> "co.uk"` – GEMI Mar 15 '19 at 08:53
1

Try this:

https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{2,256}\.([a-z]{2,6}){1}
osanger
  • 2,276
  • 3
  • 28
  • 35
1

Try to replace www by something else:

/^(?:https?:\/\/)?(?:[^@\n]+@)?(?:[^.]+\.)?([^:\/\n\?\=]+)/im

EDIT: If you absolutely want to preserve the www into your regex, you could try this one:

/^(?:https?:\/\/)?(?:[^@\n]+@)?(?:www\.)?(?:[^.]+\.)?([^:\/\n\?\=]+)/im

1111161171159459134
  • 1,216
  • 2
  • 18
  • 28
  • 1
    This one was interesting. I tried with URLs and just domains themselves. On domains, I tried ones like 'data.example.co.uk', 'example.co.uk', and 'example.com'. I tried your first rule without the www preservation. I got close but inconsistent results. I was trying to just get the root domain without subdomains. Try your example with Javascript's `.match(regexp)` API and you'll see the inconsistent results. You're on to something here -- just needs a little more work. – Volomike May 02 '22 at 02:14
1
export const extractHostname = url => {
let hostname;

// find & remove protocol (http, ftp, etc.) and get hostname
if (url.indexOf("://") > -1)
{
    hostname = url.split('/')[2];
}
else
{
    hostname = url.split('/')[0];
}

// find & remove port number
hostname = hostname.split(':')[0];

// find & remove "?"
hostname = hostname.split('?')[0];

return hostname;
};

export const extractRootDomain = url => {
let domain = extractHostname(url),
    splitArr = domain.split('.'),
    arrLen = splitArr.length;

// extracting the root domain here
// if there is a subdomain
if (arrLen > 2)
{
    domain = splitArr[arrLen - 2] + '.' + splitArr[arrLen - 1];

    // check to see if it's using a Country Code Top Level Domain (ccTLD) (i.e. ".me.uk")
    if (splitArr[arrLen - 2].length === 2 && splitArr[arrLen - 1].length === 2)
    {
        //this is using a ccTLD
        domain = splitArr[arrLen - 3] + '.' + domain;
    }
}

return domain;
};
Kanan Farzali
  • 991
  • 13
  • 23
0

This is what I've come up with. I don't know how to combine the two match rules into a single regexp, however. This routine won't properly process bad domains like example..com. It does, however, account for TLDs that are in the variety of .xx, .xx.xx, .xxx, or more than 4 character TLDs on the end. This routine will work on just domain names or entire URLs, and the URLs don't have to have the http or https protocol -- it could be ftp, chrome, and others.

function getRootDomain(s){
  var sResult = ''
  try {
    sResult = s.match(/^(?:.*\:\/?\/)?(?<domain>[\w\-\.]*)/i).groups.domain
      .match(/(?<root>[\w\-]*(\.\w{3,}|\.\w{2}|\.\w{2}\.\w{2}))$/).groups.root;
  } catch(ignore) {}
  return sResult;
}

So basically, the first routine strips out any potential stuff before the ://, if that exists, or just a :, if that exists. Next, it looks for all non-word boundary stuff except allows the dash and period like you'd potentially see in domains. It labels this into a named capture group called domain. It also prevents the domain match from including a port such as :8080 as an example. If given an empty string, it just returns an empty string back.

From there, we then do another pass on this and instead of looking from the left-to-right like you would with the preceding ^ symbol, we use the ending $ symbol, working right-to-left, and allow only 4 conditions on the end: .xx.xx, .xx, .xxx, or more than .xxx (such as 4+ character TLDs), where x is a non-word boundary item. Note the {3,} -- that means 3 or more of something, which is why we handle the TLDs that are 3 or more characters too. From there, we allow for a non-word boundary in front of that which may include dashes and periods.

EDIT: Since posting this answer, I learned how to combine the full domain and the root part into one single RegExp. However, I'll keep the above for reasons where you may want to get both values, although the function only returned the root (but with a quick edit, could have returned both full domain and root domain). So, if you just want the root alone, then you could use this solution:

function getRootDomain(s){
  var sResult = ''
  try {
    sResult = s.match(/^(?:.*?:\/\/)?.*?(?<root>[\w\-]*(?:\.\w{2,}|\.\w{2}\.\w{2}))(?:[\/?#:]|$)/).groups.root;
  } catch(ignore) {}
  return sResult;
}
Volomike
  • 23,743
  • 21
  • 113
  • 209
0

This solution works for me, also use it to validate the URL if it doesn't seems URL.

^(?:https?:\/\/)?(?:[^@\/\n]+@)?(?:www\.)?([^:\/?\n]+\.+[^:\/?\n]+)

RegEX Demo

Thanks to @anubhava