1

Let's say I am on the webpage https://company.slack.com/messages/@user1/

How could I get the URL of home page of the company/website in Java/Python which is https://slack.com/ (in this case)

Now this seems so easy for some cases, but I want to generalise this & unable to cover all cases like that of slack/google_design/etc....

Say similar cases are:

https://www.youtube.com/watch?v=deL9VeNjcH8

Expected Output: https://www.youtube.com

https://angel.co/weav-music?utm_source=lb

Expected Output: https://angel.co

https://design.google.com/

Expected Output: https://www.google.com

The code from the link below:

#include <iostream>
#include <string>

using namespace std;

int main() {
    string s = "https://angel.co/weav-music?utm_source=lb";
    int cnt=0;
    int p;
    int l=s.length();
    for(int i=0;i<l;i++)
        {
            if(s[i]=='/' && cnt!=3)
                cnt++;
            if(s[i]=='/' && cnt==3){
                p=i;break;}
        }
    cout<<s.substr(0,p);
    return 0;
}

@all Please see JonasCz's 2nd comment on his own answer that actually helped me

prashantitis
  • 1,797
  • 3
  • 23
  • 52
  • I had successfully implemented examples of Youtube, angel.co by simple detecting first '/' in the URL after http:// – prashantitis May 30 '16 at 14:39
  • We'd like to see your code. – Klaus D. May 30 '16 at 14:41
  • Sure, will post in sometime – prashantitis May 30 '16 at 14:42
  • There's no perfect way to generate the "home page" for a domain in a URL. I could choose any arbitrary subdomain as my "home page" over the conventions of "www.mycompany.com" or "mycompany.com". Your best bet is probably what is being suggested by JonasCz below. As an example, say the URL is "http://support.arbitrarydomain.org/users/14". What is your expected output from that? – Marc Talbot May 30 '16 at 14:46
  • @Marc: My expected output is "arbitrarydomain.org" – prashantitis May 30 '16 at 14:48
  • @KlausD. http://ideone.com/DGJRt2 Now if you can suggest some thing? – prashantitis May 30 '16 at 14:56
  • @KlausD. Please don't do unncessary stuff on my code, if you have something important to guide me, I will be thankful to me, Don't do rubbish things in my code – prashantitis May 30 '16 at 15:01

1 Answers1

1

You can use something like this:

URL aURL = new URL("https://company.slack.com/messages/@user1/");
System.out.println(aURL.getProtocol() + "://" + aURL.getHost());

Which prints:

https://company.slack.com

This works for other URLs too. See the docs for more details.


If you want to get only the main domain, without the subdomain (i.e. only http://slack.com), you can use Guava's InternetDomainName, eg. like this:

InternetDomainName.from("company.slack.com").topPrivateDomain().name();

The above will return slack.com.

The above method call will work for older Guava library versions. For Guava 19.0 use toString() instead of .name()


To be complete, the whole code, in your case, would look like this:

URL aURL = new URL("https://company.slack.com/messages/@user1/");
InternetDomainName.from(aURL.getHost()).topPrivateDomain().name();
Community
  • 1
  • 1
Jonas Czech
  • 12,018
  • 6
  • 44
  • 65
  • and also split the result by dots and replace subdomain with nothing if exists. – Ali Sheikhpour May 30 '16 at 14:40
  • No, but I want output to be slack.com not company.slack.com – prashantitis May 30 '16 at 14:40
  • company in company.slack.com is like a user in the domain – prashantitis May 30 '16 at 14:41
  • @Prashant But for youtube you want to have **www**.youtube.com ? How do you decide whether to keep or remove the subdomain (first) part for each URL ? I guess you could use a regex.. – Jonas Czech May 30 '16 at 14:42
  • @JonasCz: I always want to get the home page URL of the company/product. So in Youtube as well home page is YouTube.com and I would not want the video ID – prashantitis May 30 '16 at 14:45
  • @Prashant Have a look at this: [Get domain without subdomain from a URL](http://stackoverflow.com/q/3199862). You'll need Guava libarary. Or use a regex. Check this also, it may work for you: [Extract main domain name from a given url](http://stackoverflow.com/a/34466483) – Jonas Czech May 30 '16 at 14:46
  • Okay, Let me check – prashantitis May 30 '16 at 14:49
  • @JonasCz: Thanks, it seems to work in most of the cases, thanks again – prashantitis May 30 '16 at 15:08
  • @JonasCz: Please include both the links of your comment, that actually helped in your answer. – prashantitis May 30 '16 at 15:13
  • @JonasCz: I am getting Exception in thread "main" Exception in thread "main" java.lang.IllegalArgumentException: Not a valid domain name: 'https://design.google.com/resources/' at com.google.common.base.Preconditions.checkArgument(Preconditions.java:115) at com.google.common.net.InternetDomainName.(InternetDomainName.java:154) at com.google.common.net.InternetDomainName.from(InternetDomainName.java:225) at jython.Java_python.main(Java_python.java:10) This error for URL https://design.google.com/resources/ Why? – prashantitis May 31 '16 at 07:01
  • @JonasCz: Any idea? – prashantitis May 31 '16 at 07:02
  • 1
    The problem is that you need to give `InternetDomainName.from` only the domain, ie. `design.google.com`, _not_ `design.google.com/resources/`. You should be using something like this: `URL aURL = new URL("https://company.slack.com/messages/@user1/"); InternetDomainName.from(aURL.getHost()).topPrivateDomain().name();` @Prashant , what does your code look like ? – Jonas Czech May 31 '16 at 07:06
  • Ohh my bad, you told yesterday only. Thanks – prashantitis May 31 '16 at 07:09
  • @JonasCz: topPrivateDomain().name(); In this method call .name() is unrecognised for Guava 19.0 while it was working fine for 16.0 I tried checking alternative for .name() in new version here https://github.com/google/guava/wiki/Release19 but didn't find anything significiant, any other resource where i could find this? – prashantitis May 31 '16 at 08:23
  • this also has no info http://google.github.io/guava/releases/snapshot/api/docs/com/google/common/net/InternetDomainName.html#topPrivateDomain() – prashantitis May 31 '16 at 08:24