Extract top level domain from URL

Question

I need to fix this regex to grab the domain only. no subdomain, folders or files name after the top level domain. I have started it. I need help fixing the regex

There are many variations to take into consideration:

http or https
www or not
multiple subdomains
slash in the end of url
folder after top level domain

Here is the link with the first part done Link

The top 5 is working but the bottom 3 with folder and filename is not.

Here is my regex so far /([a-zA-Z0-9-]+)(\.[a-zA-Z]{2,5})?(\.[a-zA-Z]+$)

The results should be:

domain.com
masterdomain.com.au
luxury.co.uk
globo.us
test.com
google.com.br

Possible duplicate of [How do I parse a URL into hostname and path in javascript?](https://stackoverflow.com/questions/736513/how-do-i-parse-a-url-into-hostname-and-path-in-javascript) — Alex, Jun 28 '19 at 20:47
The proposed duplicate does not attempt to reduce or parse the hostname part so you get just the domain part. There are many other duplicates, many of which are horrible hacks, but the short version is, use the [Public Suffix List.](https://publicsuffix.org/) — tripleee, Jan 03 '20 at 06:22

K.Dᴀᴠɪs · Accepted Answer · 2019-06-28T21:32:01.263

3

You can try something like this:

((?<![^\/]\/)\b\w+\.\b\w{2,3}(?:\.\b\w{2})??)(?:$|\/)

Demo

Breaking Down the Pattern:

(?<![^\/]\/) Ensures that the string is not preceded by a single slash (since /index.php looks like a domain), but is okay to be preceded by double slashes (as in https://)
\b\w+\. captures the main domain, ensuring that the entire string is a word by using a word boundary on the left and requiring a dot on the right. (again, issue with it capturing everything but the i in /index.php, which is why the \b is required.)
\b\w{2,3} Matches the Top-level domain (.com)
(?:\.\b\w{2})?) Optional, captures the country specific TLD if available
(?:$|\/) Requires that the entire match is followed by either the end of string $ or a forward slash \/

Alternative that uses lookahead instead of capture group:

(?<![^\/]\/)\b\w+\.\b\w{2,3}(?:\.\b\w{2})?(?=$|\/)

Essentially, you remove the capturing group, and replace the non-capturing group at the end (?:$|\/) with a positive lookahead (?=$|\/).

Demo

edited Jun 28 '19 at 21:32

answered Jun 28 '19 at 21:09

K.Dᴀᴠɪs

9,945
11
33
43

That's great! It works like a charm. Now I just need to eliminate the forward slash when is true. Thanks a lot. If it is common in the community, I'd like to pay you a cup of coffee. – user3352263 Jun 28 '19 at 21:23
1

If you are talking about the forward slash that matches at the end `(?:$|\/)`, that is why I placed the entire match (other than this slash) in a capturing group `(...)`. If you were to return the first **sub**match, you would get everything other than the forward slash at the end. Hence, you'd only return the green in the Demo above, and exclude the blue. – K.Dᴀᴠɪs Jun 28 '19 at 21:26
Now I get. kind of new in Regex. That's perfect. Thank you very much! – user3352263 Jun 28 '19 at 21:29
1

Or, you can avoid using the capture group and go with a positive lookahead instead: `(?<![^\/]\/)\b\w+\.\b\w{2,3}(?:\.\b\w{2})?(?=$|\/)` – K.Dᴀᴠɪs Jun 28 '19 at 21:30
@K.Dᴀᴠɪs Thanks a lot. Can the same pattern used in VBA? I tried to apply it in VBA but I got VALUE error. – YasserKhalil Jul 10 '19 at 04:24
@K.Dᴀᴠɪs I just realized Firefox is throwing a SyntaxError: invalid regexp group. Do you know what that might be in regards to the syntax you provided? – user3352263 Jul 11 '19 at 16:49
@user3352263 Firefox doesn't support negative lookbehinds. Use Chrome. – K.Dᴀᴠɪs Jul 11 '19 at 17:09
1

@user3352263 & YasserKhalil: Since you can't use negative lookbehinds, you can try this [`(?:\/{2}|[^/])(\b\w+\.\b\w{2,3}(?:\.\b\w{2})?)(?=$|\/)`](https://regex101.com/r/iIYoHa/3). But this brings back the submatches, so you will need to know how to only return the submatches instead of the full match. – K.Dᴀᴠɪs Jul 11 '19 at 17:28
@YasserKhalil See above. I can only ping one person per comment. – K.Dᴀᴠɪs Jul 11 '19 at 17:32
1

@K.Dᴀᴠɪs You are the best! It work perfectly. Thank you one more time! – user3352263 Jul 11 '19 at 20:30
Hate to rain on your parade, but this does not "work perfectly". A perfect solution requires knowledge of the specific subdomain policies of each individual TLD; the [Public Suffic List](https://publicsuffix.org) provides such a database. – tripleee Jan 03 '20 at 06:17
to consider there is hypen in the domain `((?<![^\/]\/)\b[\-\w]+\.\b\w{2,3}(?:\.\b\w{2})?)(?:$|\/)` – Chester Fung Jun 05 '22 at 05:48

score 0 · Answer 2 · answered Jun 28 '19 at 20:49

0

We can likely consider this expression maybe, which has non-capturing groups, if that'd be OK:

^(?:https?:\/\/)(?:www\.)?([^\/\s]+)$|^(?:https?:\/\/)(?:www\.)?([^\/\s]+)(?:.*)$

Demo

answered Jun 28 '19 at 20:49

Emma

27,428
11
44
69

2

That's a very good idea but the problem is that it is capturing subdomains now. The only part I need is the domain. My link shows how to get it but it fails when there is something after the domain. perhaps I could add a non-capturing group to avoid that. Do you know how to do it? – user3352263 Jun 28 '19 at 21:16

Extract top level domain from URL

2 Answers2

Demo

Breaking Down the Pattern:

Demo

Demo