3

I wrote a Javascript routine that, given a hostname or a URL, it finds the root domain.

function getRootDomain(s){
  var sResult = ''
  try {
    sResult = s.match(/^(?:.*\:\/?\/)?(?<domain>[\w\-\.]*)/).groups.domain
      .match(/(?<root>[\w\-]*(\.\w{3,}|\.\w{2}|\.\w{2}\.\w{2}))$/).groups.root;
  } catch(ignore) {}
  return sResult;
}

What is the technique to combine the two regex rules into one rule?

I used this tutorial to try to advance my existing RegExp experience over the years, although I've never really understood lookbehinds and lookaheads (which might be useful here?), and then used the great tool at RegEx101.com for trial and error. What I tried was to stick what's after <root> to replace what comes after <domain>, and variations on that, and all failed.

A test set to use with a tool like RegEx101 could be:

https://test.com:8080/?id=4&re=3
https://test-test.com:8080/?id=4&re=3
https://data.test.com:8080/?id=4&re=3
https://data.test.com/?id=4&re=3
https://data.test.com/
https://data.test.com#testing
https://data.test.com/#testing
https://data.test.com:8080/#testing
https://data.test.com:8080#testing
https://data.tester.com/
https://data-test.test.com/
https://test.com
https://test.com#testing
https://test.com/
https://test.am/?id=4
https://test.com?id=3&re=3
https://test.com/?id=3&re=3
https://megatest.com/?id=3&re=3

test.com
data.test.co.uk
test.co
data.test.com
data.tester-test.com
data-test.tester-test.com
tester-test.com
about:blank
Volomike
  • 23,743
  • 21
  • 113
  • 209
  • Oh, I just noticed that you're the one who posted the answer this was taken from. I thought it was someone else asking how to improve on your answer. – Barmar May 02 '22 at 21:47
  • 1
    I saw your reputation when I was looking at the other answer. Like I said, I didn't notice that it was you posting this one. – Barmar May 02 '22 at 21:56

1 Answers1

1

The second regexp uses the $ assertion to only match the end of the .domain capture.

The first RegExp, however stops matching after the domain (when it meets either a /, a ?, a #, a : or the end of the string if there is no path, query string or hash parts. So you can't just reuse the $ assertion, it would fail in some cases.

To combine both parts, you could replace the domain capture with this:

.*?(?<root>[\w\-]*(\.\w{3,}|\.\w{2}|\.\w{2}\.\w{2}))(?:[\/?#:]|$)

(?:[\/?#]|$) at the end is a non-capturing group that matches either the target characters or the end of the string.

.*? frugally matches anything. That is, it first tries to match the root capture followed by (?:[\/?#]|$). Every time that fails, it eats one character and tries again, letting you search for the root.

Also:

  • you can combine \.\w{3,}|\.\w{2} into just \.\w{2,}.

  • you can use a non-capturing group around the TLDs ((?:...) vs (...).

  • It would be better to use .*? to get the protocol, or you could end up globbing too much (with a greedy .*, passing https://example.com/#://bar.com would return bar.com).

  • You don't need to escape the :. In unicode mode, that escape is actually a syntax error.

Resulting into

const x = /^(?:.*?:\/\/)?.*?(?<root>[\w\-]*(?:\.\w{2,}|\.\w{2}\.\w{2}))(?:[\/?#:]|$)/

I actually wrote a RegExp builder that may help you get further in your RegExp learning journey... Here's your RegExp ported to compose-regexp

Volomike
  • 23,743
  • 21
  • 113
  • 209
Pygy
  • 56
  • 5
  • 1
    This works. You'll want to make a small revision to handle potential port numbers on the domains such as `https://data.test.com:8080/`. Here's the change I made: `/^(?:.*\:\/?\/)?.*?(?[\w\-]*(\.\w{2,}|\.\w{2}\.\w{2}))(?:[\:\/?#]|$)/` I ran through regex101.com. – Volomike May 04 '22 at 03:49
  • 1
    Great catch @Volomike, thanks! I've updated the answer accordingly. (And thanks to your upvote, I can at long last comment here, which is sweet :-) – Pygy May 04 '22 at 06:49
  • 1
    @volomike, I've tweaked the response a bit with further refinements. Hopefully you'll find them helpful – Pygy May 04 '22 at 12:06
  • In this part `^(?:.*?:\/\/?)?`, why the second-to-last `?` ? Shouldn't it be `^(?:.*?:\/\/)?` ? See, the `?` alone would mean "previous character is optional", when in fact all the characters there are optional -- one might see an https://, http://, ftp://, or perhaps just start with the domain. – Volomike May 04 '22 at 20:53
  • It was present in your RegExp (applied to the first `/` I moved it to the second to put `':/`` in the same string), I thought you had a good reason to have it there... – Pygy May 04 '22 at 21:21
  • I see. Yeah, that was my mistake. It was improved when you wrapped and added the `?` afterwards on the block (instead of individual characters) to mean "anything in there is optional". – Volomike May 04 '22 at 21:24
  • 1
    I went ahead with my editing power and edited your answer to remove that extra ? that was unnecessary. – Volomike May 05 '22 at 08:22