5

This question has answer in other languages/platforms but I couldn't find a robust solution in C#. Here I'm looking for the part of URL which we use in WHOIS so I'm not interested in sub-domains, port, schema, etc.

Example 1: http://s1.website.co.uk/folder/querystring?key=value => website.co.uk
Example 2: ftp://username:password@website.com => website.com

The result should be the same when the owner in whois is the same so sub1.xyz.com and sub2.xyz.com both belong to who has the xyz.com which I'm need to extract from a URL.

Xaqron
  • 29,931
  • 42
  • 140
  • 205

4 Answers4

5

I needed the same, so I wrote a class that you can copy and paste into your solution. It uses a hard coded string array of tld's. http://pastebin.com/raw.php?i=VY3DCNhp

Console.WriteLine(GetDomain.GetDomainFromUrl("http://www.beta.microsoft.com/path/page.htm"));

outputs microsoft.com

and

Console.WriteLine(GetDomain.GetDomainFromUrl("http://www.beta.microsoft.co.uk/path/page.htm"));

outputs microsoft.co.uk

servermanfail
  • 2,532
  • 20
  • 21
  • Thanks for sharing your work. Another problem is keeping the list updated but I don't think it changes very frequently. – Xaqron Feb 13 '11 at 09:46
  • This class is great. I have cleated a compelete list of all TLDs from [the PublicSuffix list](http://publicsuffix.org/list/), updated for today. It's almost twice as big as the one you've submitted (~6390 entries) You can find the variable at http://pastebin.com/raw.php?i=PxKWw5jt, should you ever need it. :) Thank you once again! :) – moskalak Feb 02 '14 at 11:12
  • 1
    None of the links are available now. – venkat balabhadra Mar 26 '21 at 18:06
3

As @Pete noted, this is a little bit complicated, but I'll give it a try.

Note that this application must contain a complete list of known TLD's. These can be retrieved from http://publicsuffix.org/. Left extracting the list from this site as an exercise for the reader.

class Program
{
    static void Main(string[] args)
    {
        var testCases = new[]
        {
            "www.domain.com.ac",
            "www.domain.ac",
            "domain.com.ac",
            "domain.ac",
            "localdomain",
            "localdomain.local"
        };

        foreach (string testCase in testCases)
        {
            Console.WriteLine("{0} => {1}", testCase, UriHelper.GetDomainFromUri(new Uri("http://" + testCase + "/")));
        }

        /* Produces the following results:

            www.domain.com.ac => domain.com.ac
            www.domain.ac => domain.ac
            domain.com.ac => domain.com.ac
            domain.ac => domain.ac
            localdomain => localdomain
            localdomain.local => localdomain.local
         */
    }
}

public static class UriHelper
{
    private static HashSet<string> _tlds;

    static UriHelper()
    {
        _tlds = new HashSet<string>
        {
            "com.ac",
            "edu.ac",
            "gov.ac",
            "net.ac",
            "mil.ac",
            "org.ac",
            "ac"

            // Complete this list from http://publicsuffix.org/.
        };
    }

    public static string GetDomainFromUri(Uri uri)
    {
        return GetDomainFromHostName(uri.Host);
    }

    public static string GetDomainFromHostName(string hostName)
    {
        string[] hostNameParts = hostName.Split('.');

        if (hostNameParts.Length == 1)
            return hostNameParts[0];

        int matchingParts = FindMatchingParts(hostNameParts, 1);

        return GetPartOfHostName(hostNameParts, hostNameParts.Length - matchingParts);
    }

    private static int FindMatchingParts(string[] hostNameParts, int offset)
    {
        if (offset == hostNameParts.Length)
            return hostNameParts.Length;

        string domain = GetPartOfHostName(hostNameParts, offset);

        if (_tlds.Contains(domain.ToLowerInvariant()))
            return (hostNameParts.Length - offset) + 1;

        return FindMatchingParts(hostNameParts, offset + 1);
    }

    private static string GetPartOfHostName(string[] hostNameParts, int offset)
    {
        var sb = new StringBuilder();

        for (int i = offset; i < hostNameParts.Length; i++)
        {
            if (sb.Length > 0)
                sb.Append('.');

            sb.Append(hostNameParts[i]);
        }

        string domain = sb.ToString();
        return domain;
    }
}
Pieter van Ginkel
  • 29,160
  • 8
  • 71
  • 111
  • @Xaqron - I don't see how. I've copied the entire code into a new Console project and it compiles correctly and gives the expected results. Could you please be more specific on what you believe is missing? – Pieter van Ginkel Nov 08 '10 at 14:35
  • It was missing jest below GetDomainFromHostName() method, but it's there now. Thanks. – Xaqron Nov 08 '10 at 15:12
1

The closest you could get is the System.Uri.Host property, which would extract the sub1.xyz.com portion. Unfortunately, it's hard to know what exactly is the "toplevel" portion of the host (e.g. sub1.foo.co.uk versus sub1.xyz.com)

Pete
  • 11,313
  • 4
  • 43
  • 54
  • it's almost impossible to know for sure which is the toplevel, because for instance .co.uk requires two parts, but .info or .jp require something other than `.[a-zA-Z]{3}` – jcolebrand Nov 08 '10 at 02:10
  • The [Public Suffix List](http://publicsuffix.org/) can be used for this sort of task. But it's probably easiest just to `whois` the whole hostname and work up a segment at a time until you get results. – bobince Nov 08 '10 at 02:24
  • That list "should" be right, but that's my point. "should" is not a great business rule... – jcolebrand Nov 08 '10 at 02:25
  • @bobince Yeah, that's probably the most reliable way to do this, working your way up the segments. – Pete Nov 08 '10 at 12:00
0

if you need to domain name then you can use URi.hostadress in .net

if you need the url from content then you need to parse them using regex.