4

I'm need a regular expression in Java that I can use to retrieve the domain.tld part from any url. So https://foo.com/bar, http://www.foo.com#bar, http://bar.foo.com will all return foo.com.

I wrote this regex, but it's matching the whole url

Pattern.compile("[.]?.*[.x][a-z]{2,3}");

I'm not sure I'm matching the "." character right. I tried "." but I get an error from netbeans.

Update:

The tld is not limited to 2 or 3 characters, and http://www.foo.co.uk/bar should return foo.co.uk.

sjobe
  • 2,817
  • 3
  • 24
  • 32
  • Duplicate http://stackoverflow.com/questions/569137/how-to-get-domain-name-from-url – Gumbo May 14 '09 at 13:27
  • 1
    Actually not an exact duplicate, as the other question tries to remove the tld part as well as some second-level parts like ".co.uk". But the only difference is whether you capture that part. I guess he'd want http://www.foo.co.uk/ to give foo.co.uk – MSalters May 14 '09 at 13:38
  • do you know there are four letter TLDs like "info" and "name"? I think you missed that, because you got that "{2,3}" in your regular expression. Secondly, if you want to match the dot, you have to escape it like this "\\." – Tim Büthe May 14 '09 at 13:59
  • Just read that there are even ".museum" and ".travel" tlds. – Tim Büthe May 14 '09 at 14:05
  • Good catch. I would want foo.co.uk/bar to return foo.co.uk. – sjobe May 14 '09 at 14:06
  • 1
    I found this answer very useful: http://stackoverflow.com/a/4820675/1740705. – Philipp Nov 20 '14 at 11:13

8 Answers8

10

This is harder than you might imagine. Your example https://foo.com/bar, has a comma in it, which is a valid URL character. Here is a great post about some of the troubles:

https://blog.codinghorror.com/the-problem-with-urls/

https?://([-A-Za-z0-9+&@#/%?=~_()|!:,.;]*[-A-Za-z0-9+&@#/%=~_()|])

Is a good starting point

Some listings from "Mastering Regular Expressions" on this topic:

http://regex.info/listing.cgi?ed=3&p=207

@sjobe

>>> import re
>>> pattern = r'https?://([-A-Za-z0-9+&@#/%?=~_()|!:,.;]*[-A-Za-z0-9+&@#/%=~_()|])'
>>> url = re.compile(pattern)
>>> url.match('http://news.google.com/').groups()
('news.google.com/',)
>>> url.match('not a url').groups()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'groups'
>>> url.match('http://google.com/').groups()
('google.com/',)
>>> url.match('http://google.com').groups()
('google.com',)

sorry the example is in python not java, it's more brief. Java requires some extraneous escaping of the regex.

Cœur
  • 37,241
  • 25
  • 195
  • 267
jsamsa
  • 939
  • 7
  • 12
  • I dont think he meant the comma to be part of the url , he was just separating a list – RC1140 May 14 '09 at 13:27
  • 2
    That's my point, it's ambiguous. How should the regex determine if the comma is part of the URL or not? – jsamsa May 14 '09 at 13:32
  • 1
    Doesn't matter anyway, as he's interested in "domain.tld" part of an http URL. There's no comma in that part. – MSalters May 14 '09 at 13:35
  • I tried that regular expression [added a ')' at the end] https?://([-A-Za-z0-9+&@#/%?=~_()|!:,.;]*[-A-Za-z0-9+&@#/%=~_()|]) but it's not matching any url's. I'm trying "http://news.google.com" and "http://www.google.com" – sjobe May 14 '09 at 14:31
  • Your codinghorror link is broken. I'm supposing this it. http://blog.codinghorror.com/the-problem-with-urls/ – John Aug 03 '15 at 22:06
8

I would use the java.net.URI class to extract the host name, and then use a regex to extract the last two parts of the host uri.

import java.net.URI;
import java.net.URISyntaxException;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class RunIt {

    public static void main(String[] args) throws URISyntaxException {
        Pattern p = Pattern.compile(".*?([^.]+\\.[^.]+)");

        String[] urls = new String[] {
                "https://foo.com/bar",
                "http://www.foo.com#bar",
                "http://bar.foo.com"
        };

        for (String url:urls) {
            URI uri = new URI(url);
            //eg: uri.getHost() will return "www.foo.com"
            Matcher m = p.matcher(uri.getHost());
            if (m.matches()) {
                System.out.println(m.group(1));
            }
        }
    }
}

Prints:

foo.com
foo.com
foo.com
idrosid
  • 7,983
  • 5
  • 44
  • 41
6

If the string contains a valid URL then you could use a regex like (Perl quoting):

/^
(?:\w+:\/\/)?
[^:?#\/\s]*?

(
[^.\s]+
\.(?:[a-z]{2,}|co\.uk|org\.uk|ac\.uk|org\.au|com\.au|___etc___)
)

(?:[:?#\/]|$)
/xi;

Results:

url: https://foo.com/bar
matched: foo.com
url: http://www.foo.com#bar
matched: foo.com
url: http://bar.foo.com
matched: foo.com
url: ftp://foo.com
matched: foo.com
url: ftp://www.foo.co.uk?bar
matched: foo.co.uk
url: ftp://www.foo.co.uk:8080/bar
matched: foo.co.uk

For Java it would be quoted something like:

"^(?:\\w+://)?[^:?#/\\s]*?([^.\\s]+\\.(?:[a-z]{2,}|co\\.uk|org\\.uk|ac\\.uk|org\\.au|com\\.au|___etc___))(?:[:?#/]|$)"

Of course you'll need to replace the etc part.

Example Perl script:

use strict;

my @test = qw(
    https://foo.com/bar
    http://www.foo.com#bar
    http://bar.foo.com
    ftp://foo.com
    ftp://www.foo.co.uk?bar
    ftp://www.foo.co.uk:8080/bar
);

for(@test){
    print "url: $_\n";

    /^
    (?:\w+:\/\/)?
    [^:?#\/\s]*?

    (
    [^.\s]+
    \.(?:[a-z]{2,}|co\.uk|org\.uk|ac\.uk|org\.au|com\.au|___etc___)
    )

    (?:[:?#\/]|$)
    /xi;

    print "matched: $1\n";
}
Qtax
  • 33,241
  • 9
  • 83
  • 121
  • I forgot to double escape the first \w in the beginning of the string, should be "\\w". If you see any other single backslashes escape them. – Qtax May 14 '09 at 16:10
  • I've searched on google for about an hour, and find your answer suits my situation best. Thanks. But there's seems a little problem in java regex String, it should be like this "^(?:\\w+://)?[^:?#/\\s]*?([^.\\s]+\\.(?:[a-z]{2,}|co\\.uk|org\\.uk|ac\\.uk|org\\.au|com\\.au|com.cn|___etc___))(?:[:?#/].*|$)" – SalutonMondo Dec 24 '13 at 03:00
4

new URL(url).getHost()

No regex needed.

Amy B
  • 17,874
  • 12
  • 64
  • 83
3

You're going to need to get a list of all possible TLDs and ccTLDs and then match against them. You have to do this else you'll never be able to distinguish between subdomain.dom.com and hello.co.uk.

So, get your self such a list. I recommend inverting it so you store, for example, uk.co. Then, you can extract the domain from a URL by getting everying between // and / or end of line. Split at . and work backwards, matching the TLD and then 1 additional level to get the domain.

Adam Pope
  • 3,234
  • 23
  • 32
0
    /[^.]*\.[^.]{2,3}(?:\.[^.]{2,3})?$/

Almost there, but won't match when second-level domain has 3 characters like this: www.foo.com Test it here.

mel
  • 1,566
  • 5
  • 17
  • 29
0

This works for me:

public static String getDomain(String url){
    if(TextUtils.isEmpty(url)) return null;
    String domain = null;
    if(url.startsWith("http://")) {
        url = url.replace("http://", "").trim();
    } else if(url.startsWith("https://")) {
        url = url.replace("https://", "").trim();
    }
    String[] temp = url.split("/");
    if(temp != null && temp.length > 0) {
        domain = temp[0];
    }  
    return domain;
}
tomisyourname
  • 91
  • 1
  • 11
0

Code:

public class DomainUrlUtils {
    private static String[] TLD = {"com", "net"}; // top-level domain
    private static String[] SLD = {"co\\.kr"}; // second-level domain

    public static String getDomainName(String url) {
        Pattern pattern = Pattern.compile("(?<=)[^(\\.|\\/)]\\w+\\.(" + joinTldAndSld("|") + ")$");
        Matcher match = pattern.matcher(url);
        String domain = null;

        if (match.find()) {
            domain = match.group();
        }

        return domain;
    }

    private static String joinTldAndSld(String delimiter) {
        String t = String.join(delimiter, TLD);
        String s = String.join(delimiter, SLD);

        return new StringBuilder(t).append(s.isEmpty() ? "" : "|" + s).toString();
    }
}

Test:

public class DomainUrlUtilsTest {

    @Test
    public void getDomainName() throws Exception {
        // given
        String[][] domainUrls = {
            {
                "test.com",
                "sub1.test.com",
                "sub1.sub2.test.com",
                "https://sub1.test.com",
                "http://sub1.sub2.test.com"
            },
            {
                "https://domain.com",
                "https://sub.domain.com"
            },
            {
                "http://domain.co.kr",
                "http://sub.domain.co.kr",
                "http://local.sub.domain.co.kr",
                "http://local-test.sub.domain.co.kr",
                "sub.domain.co.kr",
                "domain.co.kr",
                "test.sub.domain.co.kr"
            }
        };

        String[] expectedUrls = {
            "test.com",
            "domain.com",
            "domain.co.kr"
        };

        // when
        // then
        for (int domainIndex = 0; domainIndex < domainUrls.length; domainIndex++) {
            for (String url : domainUrls[domainIndex]) {
                String convertedUrl = DomainUrlUtils.getDomainName(url);

                if (expectedUrls[domainIndex].equals(convertedUrl)) {
                    System.out.println(url + " -> " + convertedUrl);
                } else {
                    Assert.fail("origin Url: " + url + " / converted Url: " + convertedUrl);
                }
            }
        }
    }
}

Results:

test.com -> test.com
sub1.test.com -> test.com
sub1.sub2.test.com -> test.com
https://sub1.test.com -> test.com
http://sub1.sub2.test.com -> test.com
https://domain.com -> domain.com
https://sub.domain.com -> domain.com
http://domain.co.kr -> domain.co.kr
http://sub.domain.co.kr -> domain.co.kr
http://local.sub.domain.co.kr -> domain.co.kr
http://local-test.sub.domain.co.kr -> domain.co.kr
sub.domain.co.kr -> domain.co.kr
Yeongjun Kim
  • 739
  • 7
  • 19