Regular expression to retrieve domain.tld

Question

I'm need a regular expression in Java that I can use to retrieve the domain.tld part from any url. So https://foo.com/bar, http://www.foo.com#bar, http://bar.foo.com will all return foo.com.

I wrote this regex, but it's matching the whole url

Pattern.compile("[.]?.*[.x][a-z]{2,3}");

I'm not sure I'm matching the "." character right. I tried "." but I get an error from netbeans.

Update:

The tld is not limited to 2 or 3 characters, and http://www.foo.co.uk/bar should return foo.co.uk.

Duplicate http://stackoverflow.com/questions/569137/how-to-get-domain-name-from-url — Gumbo, May 14 '09 at 13:27
Actually not an exact duplicate, as the other question tries to remove the tld part as well as some second-level parts like ".co.uk". But the only difference is whether you capture that part. I guess he'd want http://www.foo.co.uk/ to give foo.co.uk — MSalters, May 14 '09 at 13:38
do you know there are four letter TLDs like "info" and "name"? I think you missed that, because you got that "{2,3}" in your regular expression. Secondly, if you want to match the dot, you have to escape it like this "\\." — Tim Büthe, May 14 '09 at 13:59
I found this answer very useful: http://stackoverflow.com/a/4820675/1740705. — Philipp, Nov 20 '14 at 11:13

score 10 · Answer 1 · edited Oct 21 '18 at 05:40

10

This is harder than you might imagine. Your example https://foo.com/bar, has a comma in it, which is a valid URL character. Here is a great post about some of the troubles:

https://blog.codinghorror.com/the-problem-with-urls/

https?://([-A-Za-z0-9+&@#/%?=~_()|!:,.;]*[-A-Za-z0-9+&@#/%=~_()|])

Is a good starting point

Some listings from "Mastering Regular Expressions" on this topic:

http://regex.info/listing.cgi?ed=3&p=207

@sjobe

>>> import re
>>> pattern = r'https?://([-A-Za-z0-9+&@#/%?=~_()|!:,.;]*[-A-Za-z0-9+&@#/%=~_()|])'
>>> url = re.compile(pattern)
>>> url.match('http://news.google.com/').groups()
('news.google.com/',)
>>> url.match('not a url').groups()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'groups'
>>> url.match('http://google.com/').groups()
('google.com/',)
>>> url.match('http://google.com').groups()
('google.com',)

sorry the example is in python not java, it's more brief. Java requires some extraneous escaping of the regex.

edited Oct 21 '18 at 05:40

Cœur

37,241
25
195
267

answered May 14 '09 at 13:25

jsamsa

939
7
12

I dont think he meant the comma to be part of the url , he was just separating a list – RC1140 May 14 '09 at 13:27
2

That's my point, it's ambiguous. How should the regex determine if the comma is part of the URL or not? – jsamsa May 14 '09 at 13:32
1

Doesn't matter anyway, as he's interested in "domain.tld" part of an http URL. There's no comma in that part. – MSalters May 14 '09 at 13:35
I tried that regular expression [added a ')' at the end] https?://([-A-Za-z0-9+&@#/%?=~_()|!:,.;]*[-A-Za-z0-9+&@#/%=~_()|]) but it's not matching any url's. I'm trying "http://news.google.com" and "http://www.google.com" – sjobe May 14 '09 at 14:31
Your codinghorror link is broken. I'm supposing this it. http://blog.codinghorror.com/the-problem-with-urls/ – John Aug 03 '15 at 22:06

score 8 · Accepted Answer · answered May 14 '09 at 15:47

I would use the java.net.URI class to extract the host name, and then use a regex to extract the last two parts of the host uri.

import java.net.URI;
import java.net.URISyntaxException;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class RunIt {

    public static void main(String[] args) throws URISyntaxException {
        Pattern p = Pattern.compile(".*?([^.]+\\.[^.]+)");

        String[] urls = new String[] {
                "https://foo.com/bar",
                "http://www.foo.com#bar",
                "http://bar.foo.com"
        };

        for (String url:urls) {
            URI uri = new URI(url);
            //eg: uri.getHost() will return "www.foo.com"
            Matcher m = p.matcher(uri.getHost());
            if (m.matches()) {
                System.out.println(m.group(1));
            }
        }
    }
}

Prints:

foo.com
foo.com
foo.com

That's actually what I ended up doing. – sjobe May 14 '09 at 16:02 — sjobe, May 14 '09 at 16:02
An what about domain names like foobar.co.uk? – Gumbo May 14 '09 at 16:22 — Gumbo, May 14 '09 at 16:22

Qtax · Answer 3 · 2009-05-14T16:12:55.473

If the string contains a valid URL then you could use a regex like (Perl quoting):

/^
(?:\w+:\/\/)?
[^:?#\/\s]*?

(
[^.\s]+
\.(?:[a-z]{2,}|co\.uk|org\.uk|ac\.uk|org\.au|com\.au|___etc___)
)

(?:[:?#\/]|$)
/xi;

Results:

url: https://foo.com/bar
matched: foo.com
url: http://www.foo.com#bar
matched: foo.com
url: http://bar.foo.com
matched: foo.com
url: ftp://foo.com
matched: foo.com
url: ftp://www.foo.co.uk?bar
matched: foo.co.uk
url: ftp://www.foo.co.uk:8080/bar
matched: foo.co.uk

For Java it would be quoted something like:

"^(?:\\w+://)?[^:?#/\\s]*?([^.\\s]+\\.(?:[a-z]{2,}|co\\.uk|org\\.uk|ac\\.uk|org\\.au|com\\.au|___etc___))(?:[:?#/]|$)"

Of course you'll need to replace the etc part.

Example Perl script:

use strict;

my @test = qw(
    https://foo.com/bar
    http://www.foo.com#bar
    http://bar.foo.com
    ftp://foo.com
    ftp://www.foo.co.uk?bar
    ftp://www.foo.co.uk:8080/bar
);

for(@test){
    print "url: $_\n";

    /^
    (?:\w+:\/\/)?
    [^:?#\/\s]*?

    (
    [^.\s]+
    \.(?:[a-z]{2,}|co\.uk|org\.uk|ac\.uk|org\.au|com\.au|___etc___)
    )

    (?:[:?#\/]|$)
    /xi;

    print "matched: $1\n";
}

I forgot to double escape the first \w in the beginning of the string, should be "\\w". If you see any other single backslashes escape them. — Qtax, May 14 '09 at 16:10
I've searched on google for about an hour, and find your answer suits my situation best. Thanks. But there's seems a little problem in java regex String, it should be like this "^(?:\\w+://)?[^:?#/\\s]*?([^.\\s]+\\.(?:[a-z]{2,}|co\\.uk|org\\.uk|ac\\.uk|org\\.au|com\\.au|com.cn|___etc___))(?:[:?#/].*|$)" — SalutonMondo, Dec 24 '13 at 03:00

score 4 · Answer 4 · answered Nov 08 '11 at 20:23

4

new URL(url).getHost()

No regex needed.

answered Nov 08 '11 at 20:23

Amy B

17,874
12
64
83

Good, but won't work inside a high throughput loop :) – Ravindranath Akila Jul 31 '15 at 07:27

score 3 · Answer 5 · answered May 14 '09 at 13:34

You're going to need to get a list of all possible TLDs and ccTLDs and then match against them. You have to do this else you'll never be able to distinguish between subdomain.dom.com and hello.co.uk.

So, get your self such a list. I recommend inverting it so you store, for example, uk.co. Then, you can extract the domain from a URL by getting everying between // and / or end of line. Split at . and work backwards, matching the TLD and then 1 additional level to get the domain.

score 0 · Answer 6 · answered Oct 15 '15 at 20:56

0

    /[^.]*\.[^.]{2,3}(?:\.[^.]{2,3})?$/

Almost there, but won't match when second-level domain has 3 characters like this: www.foo.com Test it here.

answered Oct 15 '15 at 20:56

mel

1,566
5
17
29

score 0 · Answer 7 · answered Jul 06 '16 at 11:37

This works for me:

public static String getDomain(String url){
    if(TextUtils.isEmpty(url)) return null;
    String domain = null;
    if(url.startsWith("http://")) {
        url = url.replace("http://", "").trim();
    } else if(url.startsWith("https://")) {
        url = url.replace("https://", "").trim();
    }
    String[] temp = url.split("/");
    if(temp != null && temp.length > 0) {
        domain = temp[0];
    }  
    return domain;
}

score 0 · Answer 8 · answered Sep 08 '16 at 09:00

Code:

public class DomainUrlUtils {
    private static String[] TLD = {"com", "net"}; // top-level domain
    private static String[] SLD = {"co\\.kr"}; // second-level domain

    public static String getDomainName(String url) {
        Pattern pattern = Pattern.compile("(?<=)[^(\\.|\\/)]\\w+\\.(" + joinTldAndSld("|") + ")$");
        Matcher match = pattern.matcher(url);
        String domain = null;

        if (match.find()) {
            domain = match.group();
        }

        return domain;
    }

    private static String joinTldAndSld(String delimiter) {
        String t = String.join(delimiter, TLD);
        String s = String.join(delimiter, SLD);

        return new StringBuilder(t).append(s.isEmpty() ? "" : "|" + s).toString();
    }
}

Test:

public class DomainUrlUtilsTest {

    @Test
    public void getDomainName() throws Exception {
        // given
        String[][] domainUrls = {
            {
                "test.com",
                "sub1.test.com",
                "sub1.sub2.test.com",
                "https://sub1.test.com",
                "http://sub1.sub2.test.com"
            },
            {
                "https://domain.com",
                "https://sub.domain.com"
            },
            {
                "http://domain.co.kr",
                "http://sub.domain.co.kr",
                "http://local.sub.domain.co.kr",
                "http://local-test.sub.domain.co.kr",
                "sub.domain.co.kr",
                "domain.co.kr",
                "test.sub.domain.co.kr"
            }
        };

        String[] expectedUrls = {
            "test.com",
            "domain.com",
            "domain.co.kr"
        };

        // when
        // then
        for (int domainIndex = 0; domainIndex < domainUrls.length; domainIndex++) {
            for (String url : domainUrls[domainIndex]) {
                String convertedUrl = DomainUrlUtils.getDomainName(url);

                if (expectedUrls[domainIndex].equals(convertedUrl)) {
                    System.out.println(url + " -> " + convertedUrl);
                } else {
                    Assert.fail("origin Url: " + url + " / converted Url: " + convertedUrl);
                }
            }
        }
    }
}

Results:

test.com -> test.com
sub1.test.com -> test.com
sub1.sub2.test.com -> test.com
https://sub1.test.com -> test.com
http://sub1.sub2.test.com -> test.com
https://domain.com -> domain.com
https://sub.domain.com -> domain.com
http://domain.co.kr -> domain.co.kr
http://sub.domain.co.kr -> domain.co.kr
http://local.sub.domain.co.kr -> domain.co.kr
http://local-test.sub.domain.co.kr -> domain.co.kr
sub.domain.co.kr -> domain.co.kr

Regular expression to retrieve domain.tld

8 Answers8

Linked