Extract domain from string

Question

How to extract domain from a String in JS so for each String in the below list the output will be example.com except the last two where the output should be null or undefined or empty string. I am basically just trying to extract a domain from a string and below are the test cases to verify it.

var urls = [
    "case 1 http://example.com",
    "case 2 https://example.com",
    "case 3 custume_scheme://example.com",
    "case 4 www.example.com",
    "case 5 www.example.com/staffToIgnore",
    "case 6 www.example.com?=key=leyToIgnore",
    "case 7 www.example.com ignore all those too",
    "case 8 www.example.com www.example2.com",
    "case 9 example.com need to return null",
    "case 10 wwwa.example.com need to return null",
];

The extension of the domain could be other things then .com, it could be anything from the form [a-z0-9]
sub-domains allowed.

There been several similar question to this one, but non of them as specific and non of the answers pass all the cases here.

You need to better define what your urls might look like in order for us to be able to help you with the regex. Would all your domains end with .com? would all your domains start with either www (what about subdomain?) or ://? It doesn't look like there is a true domain signature in your example to which we can write a regex of — omerts, Mar 15 '17 at 18:46
Hate to say it... but what have you tried? Show your code. :) — Matt Johnson-Pint, Mar 15 '17 at 18:55
@omerts I added qualifications, please tell me if this enough. — Ilya Gazman, Mar 15 '17 at 18:56
I really enjoyed using this module: https://www.npmjs.com/package/url Way easier to understand than a cryptic regex. — Scarysize, Mar 15 '17 at 18:56
You should post the code that you've tried so that we can help you fix it. — RJM, Mar 15 '17 at 18:58
Well, there are a lot of tutorials and tools to learn regex. Here are some that can help you. [RegexOne](https://regexone.com/), [Regular-Expressiongs.info](http://www.regular-expressions.info/tutorial.html), [Regex 101](https://regex101.com/), [Regexper](https://regexper.com/). In general though, asking for code without showing some effort is not what StackOverflow is about. (Lots of people get *paid* to write code based on a set of requirements...) — Matt Johnson-Pint, Mar 15 '17 at 19:00
What is difference between case 1 and case 9? Does the domain need a protocol, subdomain or both but not neither? Also, what's the difference between case 4 and 10? — Donnie D'Amato, Mar 15 '17 at 19:31
@MattJohnson I fallow the tutorial and came to this by my self: "(://|www\\.)([a-zA-Z\.]+)". You just taught me how to fish! I been avoiding regex for years... Tnx man — Ilya Gazman, Mar 17 '17 at 21:56

score 1 · Accepted Answer · edited May 23 '17 at 12:25

You can use Lodash to easily achieve what you need. If you are discarding all the string which contain a malformed domain, then, I set up this plunker which tells you which strings contain a domain.

var urls = [
        "case 1 http://example.com",
        "case 2 https://example.com",
        "case 3 custume_scheme://example.com",
        "case 4 www.example.com",
        "case 5 www.example.com/staffToIgnore",
        "case 6 www.example.com?=key=leyToIgnore",
        "case 7 www.example.com ignore all those too",
        "case 8 www.example.com www.example2.com",
        "case 9 example.com need to return null",
        "case 10 wwwa.example.com need to return null",
];

_.forEach(urls, function(currentS){
  //If currentS is indeed a string
  if(_.isString(currentS)){
     //If it is a url
     if(isUrl(currentS)){
       $('#urls_list' ).append('<li>'+  currentS.match(/([a-zA-Z])*\.([a-zA-Z]){0,3}(?=\s|\?|\/|$)/)[0] +'</li>');
     } else {
       $('#urls_list' ).append('<li> null </li>');
     }
  }
});

Where isUrl

//Returns true if current string s is a domain else false
function isUrl(s){
  if(_.includes(s, 'www.', '.com') || _.includes(s, '://', '.com')){
     return true
  } else {
     return false;
  }
}

Output:

currentS.match(/([a-zA-Z])*\.([a-zA-Z]){0,3}(?=\s|\?|\/|$)/)[0] returns only what you are looking for with:

([a-zA-Z])*\. : domain.
([a-zA-Z]){0,3} : com
(?=\s|\?|\/|$)/) : lookahead of a matching ?, , / or end of the string
[0] : takes first match

Anyways, if I were you I would take a look at validator which is an amazing library to check strings. It has a method isUrl which definitely tells you if a string contains an url. I was not able to import it into the plunker so I made a custom function.

You can take a look at _.includes here and to _.forEach here.

If you want to use a Regular expression instead of the second _.forEach and _.includes take a look at this answer by @Daveo.

number 3 shouldn't be null – Ilya Gazman Mar 15 '17 at 21:21 — Ilya Gazman, Mar 15 '17 at 21:21

score 0 · Answer 2 · answered Mar 15 '17 at 19:55

Found a non regex solution:

function domainFromUrl(url) {
    var index = url.indexOf("www.");
    if (index != -1) {
        url = url.substr(index + 4);
    }
    else{
        index = url.indexOf("://");
        if (index != -1) {
            url = url.substr(index + 3);
        }
        else{
            return null;
        }
    }
    return url.split(/[ /?]/i)[0].split(".");
}

Usage

var urls = [
    "case 1 http://example.com",
    "case 2 https://example.com",
    "case 3 custume_scheme://example.com",
    "case 4 www.example.com",
    "case 5 www.example.com/staffToIgnore",
    "case 6 www.example.com?=key=leyToIgnore",
    "case 7 www.example.com ignore all those too",
    "case 8 www.example.com www.example2.com",
    "case 9 example.com need to return null",
    "case 10 wwwa.example.com need to return null"
];

for (var i in urls) {
    console.log(i + ": " + domainFromUrl(urls[i]));
}

output

0: example.com
1: example.com
2: example.com
3: example.com
4: example.com
5: example.com
6: example.com
7: example.com
8: null
9: null

Patrick W. McMahon · Answer 3 · 2017-03-15T20:20:51.383

Use this regex:

/(?:[\w-]+\.)+[\w-]+/

Here is a regex demo!

Sampling:

var regex = /(?:[\w-]+\.)+[\w-]+/
regex.exec("google.com");                   ["google.com"]
regex.exec("www.google.com");               ["www.google.com"]
regex.exec("ftp://ftp.google.com");         ["ftp.google.com"]
regex.exec("http://www.google.com");        ["www.google.com"]
regex.exec("http://www.google.com/");       ["www.google.com"]
regex.exec("https://www.google.com/");      ["www.google.com"]
regex.exec("https://www.google.com.sg/");   ["www.google.com.sg"]

If you want the leading domain 'www' removed try this:

/^[^\.]+\.(.+\..+)$/

Sampling:

var regex = /^[^\.]+\.(.+\..+)$/
regex.exec("google.com");                   ["google.com"]
regex.exec("www.google.com");               ["google.com"]
regex.exec("ftp://ftp.google.com");         ["google.com"]
regex.exec("http://www.google.com");        ["google.com"]
regex.exec("http://www.google.com/");       ["google.com"]
regex.exec("https://www.google.com/");      ["google.com"]
regex.exec("https://www.google.com.sg/");   ["google.com.sg"]

learn regex. it will save you time and lines of code.

PS. I suck at regex I used a little thing called google to get this regex. You don't really need to know much about regex to use it. With so many great examples of regex. you will find what you need every time.

the www should be removed. Also please provide an output for my input strings — Ilya Gazman, Mar 15 '17 at 20:03

score 0 · Answer 4 · answered Mar 15 '17 at 20:12

Found this answer somewhere on StackOverflow:

getDomain = (url) => {
    var dom = "", v, step = 0;
    for(var i=0,l=url.length; i<l; i++) {
        v = url[i]; if(step == 0) {
            //First, skip 0 to 5 characters ending in ':' (ex: 'https://')
            if(i > 5) { i=-1; step=1; } else if(v == ':') { i+=2; step=1; }
        } else if(step == 1) {
            //Skip 0 or 4 characters 'www.'
            //(Note: Doesn't work with www.com, but that domain isn't claimed anyway.)
            if(v == 'w' && url[i+1] == 'w' && url[i+2] == 'w' && url[i+3] == '.') i+=4;
            dom+=url[i]; step=2;
        } else if(step == 2) {
            //Stop at subpages, queries, and hashes.
            if(v == '/' || v == '?' || v == '#') break; dom += v;
        }
    }
    return dom;
}

It'll return the domain without the leading and trailing stuff you want.

Extract domain from string

4 Answers4

Usage

output