Regular Expression - Extract subdomain & domain

Question

I'm trying to form a regular expression (javascript/node.js) which will extract the sub-domain & domain part from any given URL. This is what I ended up with:

[^(?:http:\/\/|www\.|https:\/\/)]([^\/]+)

Right now, I'm just considering http, https for protocol & exclude "www." portion from the subdomain+domain portion of an URL. I checked the expression & it almost works. But, here is the issue:

Success

'http://mplay.google.co.in/sadfask/asdkfals?dk=10'.match(/[^(?:http:\/\/|www\.|https:\/\/)]([^\/]+)/i)

'http://lplay.google.co.in/sadfask/asdkfals?dk=10'.match(/[^(?:http:\/\/|www\.|https:\/\/)]([^\/]+)/i)

Failure

'http://play.google.co.in/sadfask/asdkfals?dk=10'.match(/[^(?:http:\/\/|www\.|https:\/\/)]([^\/]+)/i)

'http://tplay.google.co.in/sadfask/asdkfals?dk=10'.match(/[^(?:http:\/\/|www\.|https:\/\/)]([^\/]+)/i)

I just use the first element from the result array. I'm not able to understand why "play." & "tplay." doesn't work. Could anyone please help me in this regard?

Does "/p" and "/t" have any meaning for the regular expression evaluator?

Is there any other way of extracting sub-domain & domain from any given URL using a regular expression?

Edit -

Example:

https://play.google.com/store/apps/details?id=com.skgames.trafficracer => play.google.com

https://mail.google.com/mail/u/0/#inbox => mail.google.com

anubhava · Answer 1 · 2018-07-13T14:22:08.147

96

Your regex doesn't seem correct. Try this regex:

/^(?:https?:\/\/)?(?:[^@\n]+@)?(?:www\.)?([^:\/\n?]+)/img

RegEx Demo

edited Jul 13 '18 at 14:22

answered Sep 06 '14 at 18:21

anubhava

761,203
64
569
643

what if I want only the domain name without the http(s) or www stuff? – kuklei Feb 24 '22 at 16:15
1

That's what you get in captured group #1 in above regex. Check demo. – anubhava Feb 24 '22 at 16:36

score 24 · Answer 2 · edited May 23 '17 at 12:26

24

You are about the one millionth person to try to parse URLs in JavaScript. I'm a little bit surprised you didn't see any of the existing questions on SO dating back years. The last thing you want to do is write yet another broken regexp, with all due respect to those that provided answers to your question.

There are many well documented libraries and approaches to handling this. Google it. The simplest way is to create an a element in memory, assign it an href, and then access its hostname and other properties. See http://tutorialzine.com/2013/07/quick-tip-parse-urls/. If that does not float your boat, then use a library like uri.js.

If you really don't want to use a library, and insist on reinventing the wheel, then at least do something like the following:

function get_domain_from_url(url) {
    var a = document.createElement('a').
    a.setAttribute('href', url);
    return a.hostname;
}

Essentially, you are delegating the extraction of the subdomain/domain part of the URL to the browser's URL parsing logic, which is MUCH better than anything you will ever write.

Also see Parse URL with jquery/ javascript?, Parse URL with Javascript, How do I parse a URL into hostname and path in javascript?, or parse URL with JavaScript or jQuery. How did you miss those? Sorry, I have to vote to close this as a duplicate.

edited May 23 '17 at 12:26

Community

1
1

answered Sep 06 '14 at 19:10

4

I don't need libraries. I'm aware of the libraries available for parsing URL. I need a regular expression. The scenario which I'm facing is, I can't go on write javascript code. The function takes regular expression, options and the value on which regex should be acted upon as arguments & returns the first match. – sunilkumarba Sep 06 '14 at 19:18
Great, good luck re-inventing the wheel and maintaining your broken regexps over the coming years. By the way, what do you mean by "can't go on write javascript code"? – Sep 06 '14 at 19:21
I mean, I can't send the javascript code as argument. I need to pass regular expression – sunilkumarba Sep 06 '14 at 19:47
2

Then use this one: `var urlRegex = '^(?!mailto:)(?:(?:http|https|ftp)://)(?:\\S+(?::\\S*)?@)?(?:(?:(?:[1-9]\\d?|1\\d\\d|2[01]\\d|22[0-3])(?:\\.(?:1?\\d{1,2}|2[0-4]\\d|25[0-5])){2}(?:\\.(?:[0-9]\\d?|1\\d\\d|2[0-4]\\d|25[0-4]))|(?:(?:[a-z\\u00a1-\\uffff0-9]+-?)*[a-z\\u00a1-\\uffff0-9]+)(?:\\.(?:[a-z\\u00a1-\\uffff0-9]+-?)*[a-z\\u00a1-\\uffff0-9]+)*(?:\\.(?:[a-z\\u00a1-\\uffff]{2,})))|localhost)(?::\\d{2,5})?(?:(/|\\?|#)[^\\s]*)?$';` – Sep 06 '14 at 19:49
4

This code isn't used on the browser side. It is used in node.js. Yes, node.js has "url" module which can be used. But, unfortunately I can't use it because of the reason stated earlier. Your regex takes care of most of the URL types that we are going to encounter. Thanks a lot for that. – sunilkumarba Sep 06 '14 at 20:07
For SO this answer is probably a little bit off-topic, but this lib saved my life! Thanks – Peter Merkert Apr 26 '17 at 07:36
Clever trick! Note there's an error in the code with a period at the end of the first line instead of a semicolon. – not_a_generic_user Sep 23 '18 at 14:09

Nicu Surdu · Answer 3 · 2020-09-03T12:21:00.593

11

The same RegExp as in anubhava's answer, only added support for protocol-relative URLs like //google.com:

/^(?:https?:)?(?:\/\/)?(?:[^@\n]+@)?(?:www\.)?([^:\/\n]+)/im

RegEx Demo

edited Sep 03 '20 at 12:21

answered Jan 17 '17 at 16:40

Nicu Surdu

8,172
9
68
108

Ashoka Lella · Answer 4 · 2014-09-06T18:42:59.983

10

Here's a solution ignoring everything before ://

.*\://?([^\/]+)

Incase you want to ignore www.

.*\://(?:www.)?([^\/]+)

edited Sep 06 '14 at 18:42

answered Sep 06 '14 at 18:30

Ashoka Lella

6,631
1
30
39

Nice. Thanks. But, I also need to ignore the "www." part. How can I do that? – sunilkumarba Sep 06 '14 at 18:35
So, the final regular expression is **.*\:\/\/(?:www.)?([^\/]+)** – sunilkumarba Sep 06 '14 at 18:38
1

What purpose does the "?" after (?:www.) serve? I'm curious. Thanks for the help by the way :) – sunilkumarba Sep 06 '14 at 19:04
1

have a look at this http://www.regular-expressions.info/optional.html – Ashoka Lella Sep 06 '14 at 19:12

score 3 · Answer 5 · edited Sep 06 '14 at 19:17

3

Your regex expression works pretty well. You only need to remove the brackets. The final expression is:

^(?:http:\/\/|www\.|https:\/\/)([^\/]+)

Hope it's useful!

edited Sep 06 '14 at 19:17

answered Sep 06 '14 at 19:08

Academia

3,984
6
32
49

Wil · Answer 6 · 2023-03-07T17:11:18.923

This JavaScript Regex using Named Capturing Groups breaks the link / URL up into its functional components:

console.log("https://www.sub.domain.google.com:443/maps/place/Arc+De+Triomphe/@48.8737917,2.2928388,17z?query=1&foo#hash".match(/^(?<protocol>https?:\/\/)(?=(?<fqdn>[^:/]+))(?:(?<service>www|ww\d|cdn|ftp|mail|pop\d?|ns\d?|git)\.)?(?:(?<subdomain>[^:/]+)\.)*(?<domain>[^:/]+\.[a-z0-9]+)(?::(?<port>\d+))?(?<path>\/[^?]*)?(?:\?(?<query>[^#]*))?(?:#(?<hash>.*))?/i).groups)

output:

{
  "protocol": "https://",
  "fqdn": "www.sub.domain.google.com",
  "service": "www",
  "subdomain": "sub.domain",
  "domain": "google.com",
  "port": "443",
  "path": "/maps/place/Arc+De+Triomphe/@48.8737917,2.2928388,17z",
  "query": "query=1&foo",
  "hash": "hash"
}

So you can use whatever components you like

blueseal · Answer 7 · 2021-11-24T17:13:44.137

I know I am late to the party but I want to answer the question with some extra useful info.

Get the domain name from a link using regex.

^(https?:\/\/)?(www\.)?([^\/]+)

Here is the link to above regex.

If you want to get the subdomain, split the result from one of the matches of above regex with the first occurrence of .

Note: regex is faster than language built-in modules. check below examples, regex comes out to be 15x faster than the built-in module

javascript Example with Regex:

console.time('time2');
const pttrn = /^(https?:\/\/)?(www\.)?([^\/]+)/gm
const urlInfo = pttrn.exec("https://www.google.co.in/imghp");
console.timeEnd('time2');

//time2: 0.055ms
console.log(urlInfo[0]) // https://www.google.co.in
console.log(urlInfo[1]) // https://
console.log(urlInfo[2]) // www.
console.log(urlInfo[3]) // google.co.in

Nodejs with built-in url module

console.time('time');
const url = require('url');
const urlInfo = url.parse("https://www.google.co.in/imghp");
console.timeEnd('time');

//time: 0.840ms;
console.log(urlInfo.hostname) //www.google.co.in

Regular Expression - Extract subdomain & domain

7 Answers7

RegEx Demo

Linked