0

I need to get the domain name without the top level domain suffix of a given url.

e.g

  • Url :www.google.com then output=google

  • Url :http://www.google.co.uk/path1/path2 then output=google

  • Url :http://google.co.uk/path1/path2 then output=google

  • Url :http://google.com then output=google

  • Url :http://google.co.in then output=google

  • Url :http://mail.google.co.in then output=google

For that i try this code

 var uri = new Uri("http://www.google.co.uk/path1/path2");
 var sURL = uri.Host;
 string[] aa = sURL.Split('.');
 MessageBox.Show(aa[1]);

But every time i can't get correct output(specialty url without www). after that i search no google and try to solve it but it's help less. i also see the question on stackoverflow but it can't work for me.

giammin
  • 18,620
  • 8
  • 71
  • 89
Archit
  • 630
  • 3
  • 10
  • 29
  • 1
    Your terminology is incorrect. `google.co.uk` is the host name. There is no term for the `google` part, as far s I know. – John Saunders Oct 16 '13 at 09:04
  • @JohnSaunders oky it may be my mistake but i want same output as i write. – Archit Oct 16 '13 at 09:08
  • @cbeckner Sorry to say but you can not read this question care fully. What i want as output and what the answer of 'Top level domain from URL in C# '? – Archit Oct 16 '13 at 09:23
  • 1
    My apologies. However, using the answer to that question you will be able to split the output and use the first element in the array to get what you need. – cbeckner Oct 16 '13 at 09:28
  • @cbeckner I try it more and also search on google but i can't solve it so i put question here. – Archit Oct 16 '13 at 09:30
  • 3
    I dont think this problem is practically solvable as what you are essentially asking for is can you please help me find a random string inside a string. I have answered a question similar to this before. What you want "google" is a non existent entity that you are referring to as a domain. In actual fact the Domain is 'google.co.uk', 'google.com' etc. because of the way urls work ie subdomains mail.google.com, you can not reliably split the string. The solution posted uses a hard coded list of TLD to find the 'domain' which IMHO is unmaintainable as TLD are continuously added. – Nicholas King Oct 16 '13 at 09:30
  • Since this is a problem you can't really solve (at least not the way you'd like to do it; see Nicholas' comment as well as mine under Ahmed's answer), would you mind elaborating a bit more on what you're trying to achieve in the long run? I have a feeling that there's a lot more elegant way to do it, ignoring the whole domain name/TLD issue. – Mario Oct 16 '13 at 10:01
  • I think you should change your approach and use the full domain – giammin Oct 16 '13 at 11:10
  • The *real question* is: What would you need this for? And the answer will be something along the lines: you need to do something else, to rework the underlying motivation. – Kuba hasn't forgotten Monica Oct 16 '13 at 13:43

4 Answers4

1

This answer is just for completeness, cause I think it would be a valid approach, if it wouldn't be so complicated and essentially abuse the DNS system. Note that this isn't 100% foolproof either (and requires access to a DNS).

  • Extract the full domain name of the URL. Let's take http://somepart.subdomain.example.org/some/files as an example. We'd get somepart.subdomain.example.org.
  • Split the domain name at dots: {"somepart", "subdomain", "example", "org"}.
  • Take the rightmost part (org) and see whether it is a known (top level) domain name.
    • If it is, the next part to the left is the domain name you're looking for.
    • If it isn't, try to retrieve an IP for this.
    • If there's an IP for it, the last added part is your domain name.
    • If there isn't an IP either, add the next part to the left and repeat these checks (in this example you'd now test for example.org).
Mario
  • 35,726
  • 5
  • 62
  • 78
  • funny mario, I was thinking something along the same line. But again the practicality of coding such a solution would put me off, and like you say the solution is not 100% foolproof. +1 though for the most sensible solution on the answer :-) – Nicholas King Oct 16 '13 at 10:30
1

The right answer to your question is: No you can't.

The only solution that can almost achieve it in a dirty and not easy to maintain way is to have a list with all the existent TopLevelDomain (you can find an incomplete one in this SO answer)

var allTld = new[] {".com", ".it",".co.uk"}; //there you have find a really big list of all TLD
string urlToCheck = "www.google.com";//sports-ak.espn.go.com/nfl/  http://www.google.co.uk/path1/path2
if (!urlToCheck.StartsWith("http", StringComparison.OrdinalIgnoreCase))
{
    urlToCheck = string.Concat("http://", urlToCheck);
}
var uri = new Uri(urlToCheck);

string domain = string.Empty;
for (int i = 0; i < allTld.Length; i++)
{
    var index = uri.Host.LastIndexOf(allTld[i], StringComparison.OrdinalIgnoreCase);
    if (index>-1)
    {
        domain = uri.Host.Substring(0, index);
        index = domain.LastIndexOf(".", StringComparison.Ordinal);
        if (index>-1)
        {
            domain = domain.Substring(index + 1);break;
        }
    }
}
if (string.IsNullOrEmpty(domain))
{
    throw new Exception(string.Format("TLD of url {0} is missing", urlToCheck));
}

IMHO You should ask yourself: Do I really need the name without the TLD?

Community
  • 1
  • 1
giammin
  • 18,620
  • 8
  • 71
  • 89
  • -1 this does not solve the problem as outlined below. Original : sports-ak.espn.go.com/nfl URL : sports-ak.espn.go.com/nfl Domain : sports-ak.espn.go.com Domain Part : sports-ak should be go – Nicholas King Oct 16 '13 at 10:32
  • @NicholasKing it WORKS for sports-ak.espn.go.com. It returns "go" – giammin Oct 16 '13 at 10:37
  • Sorry you are right it does. I diddnt see you had an amended version of @Ahmeds code. But this solution still depends on maintaining a list of all tld's in existence, which IMHO falls under the category of not practically possible. – Nicholas King Oct 16 '13 at 10:45
  • @NicholasKing Yes I agree with you. It is not maintainable but it is the only solution. Even a person can take the wrong name from a url... The question is: I really need the name without the TLD??? Does this SO question make any sense??? – giammin Oct 16 '13 at 10:50
  • no I dont think it does, I think the OP needs to go away and think what it is they are trying to achieve as at the moment it is difficult to provide an answer that is practical. I believe that the OP has trivialised the makeup of a url which is actually very complicated and does not really conform to any pattern. I am happy to remove my -1 if you caveat your answer with the fact that the solution provided is not really practical – Nicholas King Oct 16 '13 at 10:54
0

This is the best you can get. It's not a maintainable solution, it is not a "fast" solution. (GetDomain.GetDomainFromUrl should be optimized).

  • Use GetDomain.GetDomainFromUrl
  • In TldPatterns.EXACT add "co.uk" (I don't know why it doesn't exist in the first place)
  • Some other minor string manipulations

This what it should look like:

using System;
using System.Collections.Generic;
using System.Text.RegularExpressions;

        class TldPatterns
        {
            private TldPatterns()
            {
                // Prevent instantiation.
            }

            /**
             * If a hostname is contained in this set, it is a TLD.
             */
            static public string[] EXACT = new string[] {
             "gov.uk",
             "mil.uk",
             "co.uk",
             //...

    public class Program
    {

        static void Main(string[] args)
        {
            string[] urls = new[] {"www.google.com", "http://www.google.co.uk/path1/path2 ", "http://google.co.uk/path1/path2 ",
            "http://google.com", "http://google.co.in"};
            foreach (var item in urls)
            {
                string url = item;
                if (!Regex.IsMatch(item, "^\\w+://"))
                    url = "http://" + item;
                var domain = GetDomain.GetDomainFromUrl(url);
                Console.WriteLine("Original    : " + item);
                Console.WriteLine("URL         : " + url);
                Console.WriteLine("Domain      : " + domain);
                Console.WriteLine("Domain Part : " + domain.Substring(0, domain.IndexOf('.')));
                Console.WriteLine();
            }
        }
    }

Outputs:

Original    : www.google.com
URL         : http://www.google.com
Domain      : google.com
Domain Part : google

Original    : http://www.google.co.uk/path1/path2
URL         : http://www.google.co.uk/path1/path2
Domain      : google.co.uk
Domain Part : google

Original    : http://google.co.uk/path1/path2
URL         : http://google.co.uk/path1/path2
Domain      : google.co.uk
Domain Part : google

Original    : http://google.com
URL         : http://google.com
Domain      : google.com
Domain Part : google

Original    : http://google.co.in
URL         : http://google.co.in
Domain      : google.co.in
Domain Part : google
Community
  • 1
  • 1
Ahmed KRAIEM
  • 10,267
  • 4
  • 30
  • 33
  • yes you are right. but i given only example url not exact url it should be change every time, – Archit Oct 16 '13 at 09:48
  • I think that this would do in most cases. – Ahmed KRAIEM Oct 16 '13 at 09:51
  • This does not work with the url http://sports-ak.espn.go.com/nfl/ Original : http://sports-ak.espn.go.com/nfl/ URL : http://sports-ak.espn.go.com/nfl/ Domain : sports-ak.espn.go.com Domain Part : sports-ak – Nicholas King Oct 16 '13 at 09:51
  • And what for this one 'http://sports-ak.espn.go.com/nfl/' ? – Archit Oct 16 '13 at 09:52
  • @Archit this method will not match when you have a url with multiple sub domains huge sites such as google and espn maintain this type of structure. – Nicholas King Oct 16 '13 at 09:53
  • @NicholasKing yes it doesn't. Because one does not simply get everything from a string array. – Ahmed KRAIEM Oct 16 '13 at 09:55
  • @NicholasKing yes you are right and i also know it that's why i am here to solve it. – Archit Oct 16 '13 at 09:55
  • 1
    There's no 100% dynamic way, because you can't really determine whether the TLD consists of one or two parts (unless you specifically hardcode/name all you can think of; but even that's not perfect). The code in this example will work with any URL that is either no sub domain or that only uses a known sub domain (like `www`). – Mario Oct 16 '13 at 09:55
  • @Archit Thats what i am saying, unless you are trying to solve a problem that within the realms of practicality is unsolvable. – Nicholas King Oct 16 '13 at 09:56
  • To elaborate a bit more: The URL `http://name/something` should return `name`, but `http://mario.name/something` should return `mario` instead. There's no 100% foolproof way to do it, so the best way would be to stick to the complete domain name (don't try to cut off the TLD nor the subdomain). If it's for some grouping (like determine all referral URLs coming from Google), try to use pattern matching instead. – Mario Oct 16 '13 at 09:57
  • @Mario that is what I have been trying to explain to Archit. The problem is not really a practically solvable problem. :-) – Nicholas King Oct 16 '13 at 09:57
  • @Archit only a psychic program could solve this problem. – Ahmed KRAIEM Oct 16 '13 at 09:57
  • @Mario `http://mario.name/something` it should return 'name'. – Archit Oct 16 '13 at 09:58
  • 1
    @Archit but name in this instance would be the equivelant of returning .com – Nicholas King Oct 16 '13 at 09:59
  • 2
    Now even a psychic program couldn't solve it. – Ahmed KRAIEM Oct 16 '13 at 09:59
  • 1
    @Archit That doesn't make any sense. `name` is a TLD, not a domain by itself? Why would `mario.name` return `name` and `mario.com` `mario`? – Mario Oct 16 '13 at 09:59
  • @Archit i think you need to go away and read about domains and how they work, rethink your problem and then come back and ask. – Nicholas King Oct 16 '13 at 09:59
  • @NicholasKing you tell about domain that all think is right. but my problem is get`name` form `http://mario.name/something` . – Archit Oct 16 '13 at 10:02
  • Ahmed's code should return `mario` if you've added `name` as a known TLD to the `EXACT` array. – Mario Oct 16 '13 at 10:06
  • @Archit YOU CANT!!!!your concept of name is something that does not exist, has no pattern and can appear anywhere in a URL string.The problem is unsolvable.in the valid url http://sports-ak.espn.go.com/nfl/ the 'name' is go in the url metropolitan.police.gov.uk the 'name' is metropolitan, in the url my.really.sub.domained.site.tld.co.uk/site the 'name' is tld.There is no pattern to match the 'name' it can appear anywhere and the only 100% way of doing it is by knowing and maintaining all tld in which is not practical.Even then i dont know how you deal with a domain such as my.co.uk.domain.com – Nicholas King Oct 16 '13 at 10:09
0

I have tested using following Regex with your all cases and it works.

string url = "http://www.google.co.uk/path1/path2";
Regex rgx = new Regex(@"(http(s?)://)?(www.)?((?<content>.*?)\.){1}([\w]+\.?)+");
Match MatchResult = rgx.Match(url);
string result = MatchResult.Groups["content"].Value; //google
John Saunders
  • 160,644
  • 26
  • 247
  • 397
mit
  • 1,763
  • 4
  • 16
  • 27
  • 1
    -1 does not match sub domains – Nicholas King Oct 16 '13 at 10:19
  • 2
    As with other approaches, this won't work with unknown (sub) domains or TLDs. Let's assume you're feeding it `http://livingroom.home/` it would probably return `livingroom` while the expected domain might be `home`. In a similar way it won't parse `http://maps.google.com/` to return `google`. – Mario Oct 16 '13 at 10:21