13

I used the following to extract the domain from a url: (They are test cases)

String regex = "^(ww[a-zA-Z0-9-]{0,}\\.)";
ArrayList<String> cases = new ArrayList<String>();
cases.add("www.google.com");
cases.add("ww.socialrating.it");
cases.add("www-01.hopperspot.com");
cases.add("wwwsupernatural-brasil.blogspot.com");
cases.add("xtop10.net");
cases.add("zoyanailpolish.blogspot.com");

for (String t : cases) {  
    String res = t.replaceAll(regex, "");  
}

I can get the following results:

google.com
hopperspot.com
socialrating.it
blogspot.com
xtop10.net
zoyanailpolish.blogspot.com

The first four cases are good. The last one is not good. What I want is: blogspot.com for the last one, but it gives zoyanailpolish.blogspot.com. What am I doing wrong?

James P.
  • 19,313
  • 27
  • 97
  • 155
chnet
  • 1,993
  • 9
  • 36
  • 51
  • 1
    It looks like the regexes in [this post](http://stackoverflow.com/questions/6433799/regular-expression-to-remove-subdomain-from-root-domain-in-list-notepad-or-gv) might help you =) – Josh Darnell Aug 27 '11 at 20:59
  • 2
    Then don’t put those silly woublewoos in your pattern. If all you want is to `s/^[^.]+\.//`, then I suggest you do that. – tchrist Aug 27 '11 at 20:59
  • 2
    Not clear what you want, though. Are you trying to remove the first component _always_, or all components but the one just before the TLD, or the first one only when it starts with a "ww" or ....? – Ray Toal Aug 27 '11 at 21:01
  • It is not only replace 'ww'. I added a new example above. For example, "xtop10.net", What I want is: "xtop10.net". While "zoyanailpolish.blogspot.com" should be "blogspot.com" – chnet Aug 27 '11 at 21:08
  • to @tchrist, your suggestion is application in vim, I think. But, what I want is different. I not just want to replace the first "ww". In some cases, for example, "xtop10.net". What I want is "xtop10.net". But your method would return "net". – chnet Aug 27 '11 at 21:10
  • In other words, you want the main domain and not subdomains. Correct? – James P. Aug 27 '11 at 21:13
  • You still haven’t explained what you want. Now it looks like you should just split on a dot and keep the last two elements returned. – tchrist Aug 27 '11 at 21:13
  • 8
    How about domains like `example.com.tw` and `example.co.uk`? – BalusC Aug 27 '11 at 21:18
  • to @James Poulson, Right. I want the main domain and not the subdomains – chnet Aug 27 '11 at 21:30
  • to @BalusC, in your cases. I prefer to return without any changes. That said, it returns "example.com.tw" and "example.co.uk" – chnet Aug 27 '11 at 21:31
  • Don't forget that '-' and other characters are allowed in the URL. (Think outside of ASCII) – user823959 Aug 27 '11 at 21:35
  • Related: http://stackoverflow.com/questions/3199862/get-domain-without-subdomain-from-a-url http://stackoverflow.com/questions/1923815/get-the-second-level-domain-of-an-url-java http://stackoverflow.com/questions/3199343/regex-to-match-domain-cctld – James P. Aug 27 '11 at 21:38
  • 3
    Don't do it the hard regex way then. Using regex for this kind of problem is ridiculous. Split on dot into an array. Count the parts. Check if second last part isn't <=3 chars and/or starts with `co` (there are probably other ccTLDs you'd like to match). Grab the last two or three items depending on the outcome and join them together on the dot again. – BalusC Aug 27 '11 at 21:38
  • to @BalusC, right. I agree with you. What do you mean the second last part is not <=3? Could you explain more? – chnet Aug 27 '11 at 21:49
  • BalusC is probably referring to the number of chracters in the url part. Regex is cool but you should probably drop it as a tool in favour of something else if the expression becomes overly complex. – James P. Aug 27 '11 at 21:51
  • How do you determine whether something is a “main domain” or not? `foo.bar.com` and `foo.bar.co.uk` and `foo.bar.pvt.k12.wy.us` don’t look anything alike. How will you decide to only drop the `foo` but stop at the `bar` in each one, since you get a differing number of dots back? – tchrist Aug 27 '11 at 21:55
  • to @tchrist, right. I may not consider so many possibilities. – chnet Aug 27 '11 at 22:11

7 Answers7

14

Using Guava library, we can easily get domain name:

InternetDomainName.from(tld).topPrivateDomain()

Refer API link for more details

https://google.github.io/guava/releases/14.0/api/docs/

http://docs.guava-libraries.googlecode.com/git/javadoc/com/google/common/net/InternetDomainName.html

evandrix
  • 6,041
  • 4
  • 27
  • 38
Satya
  • 237
  • 4
  • 8
8

Obtain the host through REGEX is pretty complicated or impossible because TLD's don't obey to simple rules but are provided by ICANN and change in time.

You should use instead the functionality provided by JAVA library like this:

URL myUrl = new URL(urlString);
myUrl.getHost();
user823959
  • 782
  • 2
  • 9
  • 30
  • 1
    Well, yes, but he already has all that. He wants to sometimes shift off some number of leading elements of the little-endian hostname, although he hasn’t told us how to know how many those might be. He seems to think we can eyeball domainnames and know whether the part we have is the “main” part already or not. I don’t think that’s possible. – tchrist Aug 27 '11 at 21:59
  • 9
    For the record, this does not answer the question. This returns whatever domain name was given including the subdomain. The OP was looking for the "root" domain name without subdomains, so if given "www.google.com" it should return "google.com". This method returns "www.google.com". This does work nicely if you are just trying to get the domain from a URL with a path and/or query string. – nerdherd Sep 08 '14 at 03:50
4

This is 2013 and solution I found is straight forward:

System.out.println(InternetDomainName.fromLenient(uriHost).topPrivateDomain().name());
akshayb
  • 1,219
  • 2
  • 18
  • 44
3

It is much simpler:

  try {
        String domainName = new URL("http://www.zoyanailpolish.blogspot.com/some/long/link").getHost();

        String[] levels = domainName.split("\\.");
        if (levels.length > 1)
        {
            domainName = levels[levels.length - 2] + "." + levels[levels.length - 1];
        }

        // now value of domainName variable is blogspot.com
    } catch (Exception e) {}
Ayaz Alifov
  • 8,334
  • 4
  • 61
  • 56
2

As suggested by BalusC and others the most practical solution would be to get a list of TLDs (see this list), save them to a file, load them and then determine what TLD is being used by a given url String. From there on you could constitute the main domain name as follows:

    String url = "zoyanailpolish.blogspot.com";

    String tld = findTLD( url ); // To be implemented. Add to helper class ?

    url = url.replace( "." + tld,"");  

    int pos = url.lastIndexOf('.');

    String mainDomain = "";

    if (pos > 0 && pos < url.length() - 1) {
        mainDomain = url.substring(pos + 1) + "." + tld;
    }
    // else: Main domain name comes out empty

The implementation details are left up to you.

James P.
  • 19,313
  • 27
  • 97
  • 155
  • to @James Poulson, Thanks. sorry, what is the output of your example? I do not quite understand. It remove tld first, then add it later. So, what is the final output? – chnet Aug 27 '11 at 22:19
  • There is no output as this is pseudocode. A text file listing the TLDs needs to be created (TLDs can be found on the Wikipedia link), these need to be read into a data structure and the findTLD method needs to be filled in. If done correctly it should do what you want which in this case would give blogspot.com. – James P. Aug 27 '11 at 22:21
  • to @James Poulson, right. Assume I get tld, the pseudo example would remove `.com` from url. Then, it moves to the dot position before `blogspot`. In this way, you can remove `zoyanailpolish `. – chnet Aug 27 '11 at 22:23
  • That's the idea :) . If you encounter any issues getting it to work let me know. – James P. Aug 27 '11 at 22:26
  • 1
    Probably this is not a good idea anymore as there are thousands of new TLD's coming in the next years. – andreas Jul 18 '14 at 10:55
1
InternetDomainName.from("test.blogspot.com").topPrivateDomain() -> test.blogspot.com

This works better in my case:

InternetDomainName.from("test.blogspot.com").topDomainUnderRegistrySuffix() -> blogspot.com

Details: https://github.com/google/guava/wiki/InternetDomainNameExplained

Tinus Tate
  • 2,237
  • 2
  • 12
  • 33
1

The reason why your are seeing zoyanailpolish.blogspot.com is that your regex finds only strings that start with a 'ww'. What you are asking is that in addition to removing all strings that start with a 'ww' , it should also work for a string starting with 'zoyanailpolish' (?). In that case , use the regex String regex = "^((ww|z|a)[a-zA-Z0-9-]{0,}\\.)"; This will remove any word that starts with a 'ww' or 'z' or 'a'. Customize it based on what you need exactly.

Bhaskar
  • 7,443
  • 5
  • 39
  • 51
  • Right. in addition to removing all strings that start with a 'ww'. It should also work for a string staring with others (not only 'zoyanailpolish'). For example, "xyz.blogspot.com". – chnet Aug 27 '11 at 21:29
  • 1
    but as you showed for `xtop10.net` it does not remove `xtop10` - so that means for certain strings it does not remove - right ? The question is - is it a custom list of string you want not to remove or there is a rule based on which this works ? – Bhaskar Aug 27 '11 at 21:34
  • to @Bhaskar, It depends. For example, `xtop10.net`, it is a website. It is a domain name. I do not need to do any changes. While for `zoyanailpolish.blogspot.com`, the domain name should be `blogspot.com`. So, I need to remove `zoyanailpolish`. – chnet Aug 27 '11 at 21:37
  • It is very clear what @chnet wants: "Right. I want the main domain and not the subdomains" – James P. Aug 27 '11 at 21:41
  • 1
    @James It is? Then he should have said that, now shouldn’t he? I hope he has fun telling that `.com`, `.co.uk` and `pvt.k12.wy.us` all count as the same sort of thing. – tchrist Aug 27 '11 at 21:44
  • @chnet : Honestly , if getting the domain name is your concern , then using regex is not the correct approach IMO. There are other techniques in Java for parsing urls and extracting domain names. – Bhaskar Aug 27 '11 at 21:47
  • @tchrist: Ninth comment down below the question. I guessed what was needed from the last url string. The chances are that a Regex would require a horrendous expression to account for all possibilities so alternative solutions have been posted. – James P. Aug 27 '11 at 21:48