42

I am trying to extract just the domain name from a URL string. I almost have it... I am using URI

I have a string.. my first thought was to use Regex but then i decided to use URI class

http://www.google.com/url?sa=t&source=web&ct=res&cd=1&ved=0CAgQFjAA&url=http://www.test.com/&rct=j&q=test&ei=G2phS-HdJJWTjAfckvHJDA&usg=AFQjCNFSEAztaqtkaIvEzxmRm2uOARn1kQ

I need to convert the above to google.com and google without the www

I did the following

Uri test = new Uri(referrer);
log.Info("Domain part : " + test.Host);

Basically this returns www.google.com .... i would like to try and return 2 forms if possible... as mentioned...

google.com and google

Is this possible with URI?

John Saunders
  • 160,644
  • 26
  • 247
  • 397
mark smith
  • 20,637
  • 47
  • 135
  • 187
  • 3
    What should the result be for 'foo.bar.com'? What about 'foo.co.uk'? What about 'foo.bar.museum'? – Mark Byers Jan 28 '10 at 11:45
  • Hi Mark... basically i am after the pure domain name ... so hence if it starts with ww3.test.co.uk then it should return test.co.uk as this is the pure domain .... So in your example foo.co.uk should return foo.co.uk as this is the pure domain .... and foo.bar.museum would return bar.museum but . museum is not a valid top level domain like .com, co.uk, .us etc is it??? ... – mark smith Jan 28 '10 at 11:48
  • 2
    .museum, .mobi and .travel are perfectly valid top level domain names. Could you clarify please, why ww3 is not a part of 'pure' domain name, while foo is? What is *your* definition of a pure domain name? – Igor Korkhov Jan 28 '10 at 13:02
  • Mark, maybe you should explain what your goal is? – egrunin Jan 28 '10 at 13:29
  • This is absolutely a valid request. I'm trying to strip a list of about 40,000 domain names and get to the "pure" domain part. Often the prefixed part of the domain lead to back end functions. We trying to provide a public domain for users to navigate to, which is generally the latter part. I'm surprised this has never been answered! – Nugs Jun 09 '18 at 05:09

12 Answers12

29

Yes, it is possible use:

Uri.GetLeftPart( UriPartial.Authority )
Dewfy
  • 23,277
  • 13
  • 73
  • 121
8

Use Nager.PublicSuffix

install-package Nager.PublicSuffix

var domainParser = new DomainParser(new WebTldRuleProvider());

var domainName = domainParser.Get("sub.test.co.uk");
//domainName.Domain = "test";
//domainName.Hostname = "sub.test.co.uk";
//domainName.RegistrableDomain = "test.co.uk";
//domainName.SubDomain = "sub";
//domainName.TLD = "co.uk";
Toolkit
  • 10,779
  • 8
  • 59
  • 68
7

I tried pretty much every approach but all of them fell short of the desired result. So here is my approach adjusted from servermanfail.

The tld file is available on https://publicsuffix.org/list/ I have taken the file from https://publicsuffix.org/list/effective_tld_names.dat parse it and search for the tld's. If new tld's are published just download the latest file.

have fun.

using System;
using System.Collections.Generic;
using System.IO;

namespace SearchWebsite
{
internal class NetDomain
{
    static public string GetDomainFromUrl(string Url)
    {
        return GetDomainFromUrl(new Uri(Url));
    }

    static public string GetDomainFromUrl(string Url, bool Strict)
    {
        return GetDomainFromUrl(new Uri(Url), Strict);
    }

    static public string GetDomainFromUrl(Uri Url)
    {
        return GetDomainFromUrl(Url, false);
    }

    static public string GetDomainFromUrl(Uri Url, bool Strict)
    {
        initializeTLD();
        if (Url == null) return null;
        var dotBits = Url.Host.Split('.');
        if (dotBits.Length == 1) return Url.Host; //eg http://localhost/blah.php = "localhost"
        if (dotBits.Length == 2) return Url.Host; //eg http://blah.co/blah.php = "localhost"
        string bestMatch = "";
        foreach (var tld in DOMAINS)
        {
            if (Url.Host.EndsWith(tld, StringComparison.InvariantCultureIgnoreCase))
            {
                if (tld.Length > bestMatch.Length) bestMatch = tld;
            }
        }
        if (string.IsNullOrEmpty(bestMatch))
            return Url.Host; //eg http://domain.com/blah = "domain.com"

        //add the domain name onto tld
        string[] bestBits = bestMatch.Split('.');
        string[] inputBits = Url.Host.Split('.');
        int getLastBits = bestBits.Length + 1;
        bestMatch = "";
        for (int c = inputBits.Length - getLastBits; c < inputBits.Length; c++)
        {
            if (bestMatch.Length > 0) bestMatch += ".";
            bestMatch += inputBits[c];
        }
        return bestMatch;
    }


    static private void initializeTLD()
    {
        if (DOMAINS.Count > 0) return;

        string line;
        StreamReader reader = File.OpenText("effective_tld_names.dat");
        while ((line = reader.ReadLine()) != null)
        {
            if (!string.IsNullOrEmpty(line) && !line.StartsWith("//"))
            {
                DOMAINS.Add(line);
            }
        }
        reader.Close();
    }


    // This file was taken from https://publicsuffix.org/list/effective_tld_names.dat

    static public List<String> DOMAINS = new List<String>();
}

}

Cedi P
  • 71
  • 1
  • 2
  • 3
    I have used your solution and it worked pretty well until I discovered some bugs. For instance `www.navistar.com` does not strip the `www` part. This is the fix `if (!url.Host.EndsWith("." + tld, StringComparison.InvariantCultureIgnoreCase)) continue;` – wpfwannabe Oct 16 '15 at 14:17
6

google.com is not guaranteed to be the same as www.google.com (well, for this example it technically is, but may be otherwise).

maybe what you need is actually remove the "top level" domain and the "www" subodmain? Then just split('.') and take the part before the last part!

naivists
  • 32,681
  • 5
  • 61
  • 85
5

Below is some code that will give just the SLD plus gTLD or ccTLD extension (note the exception below). I do not care about DNS.

The theory is as follows:

  • Anything under 3 tokens stays as is e.g. "localhost", "domain.com", otherwise: The last token must be a gTLD or ccTLD extension.
  • The penultimate token is considered part of the extension if it's length < 3 OR if included in a list of exceptions.
  • Finally the token before that one is considered the SLD. Anything before that is considered a subdomain or a host qualifier, e.g. Www.

As for the code, short & sweet:

private static string GetDomainName(string url)
{
    string domain = new Uri(url).DnsSafeHost.ToLower();
    var tokens = domain.Split('.');
    if (tokens.Length > 2)
    {
        //Add only second level exceptions to the < 3 rule here
        string[] exceptions = { "info", "firm", "name", "com", "biz", "gen", "ltd", "web", "net", "pro", "org" }; 
        var validTokens = 2 + ((tokens[tokens.Length - 2].Length < 3 || exceptions.Contains(tokens[tokens.Length - 2])) ? 1 : 0);
        domain = string.Join(".", tokens, tokens.Length - validTokens, validTokens);
    }
    return domain;
}

The obvious exception is that this will not deal with 2-letter domain names. So if you're lucky enough to own ab.com you'll need to adapt the code slightly. For us mere mortals this code will cover just about every gTLD and ccTLD, minus a few very exotic ones.

Peter O.
  • 32,158
  • 14
  • 82
  • 96
anoordende
  • 59
  • 1
  • 1
3

I think you are displaying a misunderstanding of what constitutes a "domain name" - there is no such thing as a "pure domain name" in common usage - this is something you will need to define if you want consistent results.
Do you just want to strip off the "www" part? And then have another version which strips off the top level domain (eg. strip off the ".com" or the ".co.uk" etc parts?) Another answer mentions split(".") - you will need to use something like this if you want to exclude specific parts of the hostname manually, there's nothing within the .NET framework to meet your requirements exactly - you'll need to implement these things yourself.

David_001
  • 5,703
  • 4
  • 29
  • 55
  • I like Nager library definition - a registrable domain name domainName.Hostname = "sub.test.co.uk"; domainName.RegistrableDomain = "test.co.uk"; – rothschild86 Jan 09 '23 at 16:03
3

I came up with the below solution (using Linq) :

    public string MainDomainFromHost(string host)
    {
        string[] parts = host.Split('.');
        if (parts.Length <= 2)
            return host; // host is probably already a main domain
        if (parts[parts.Length - 1].All(char.IsNumber))
            return host; // host is probably an IPV4 address
        if (parts[parts.Length - 1].Length == 2 && parts[parts.Length - 2].Length == 2)
            return string.Join(".", parts.TakeLast(3)); // this is the case for co.uk, co.in, etc...
        return string.Join(".", parts.TakeLast(2)); // all others, take only the last 2
    }
1

Yes, ive posted the solution here: http://pastebin.com/raw.php?i=raxNQkCF

If you want to remove the extension just add

if (url.indexof(".")>-1) {url = url.substring(0, url.indexof("."))}

maxp
  • 24,209
  • 39
  • 123
  • 201
1

Uri's Host always returns domain (www.google.com), including a label (www) and a top-level domain (com). But often you would want to extract the middle bit. Simply I do

Uri uri;
bool result = Uri.TryCreate(returnUri, UriKind.Absolute, out uri);
if (result == false)
    return false;

//if you are sure it's not "localhost"
string domainParts = uri.Host.Split('.');
string topLevel = domainParts[domainParts.Length - 1]
string hostBody = domainParts[domainParts.Length - 2]
string label = domainParts[domainParts.Length - 3]

But you do need to check domainParts.length, as often the given uri is like "google.com".

Andrew Chaa
  • 6,120
  • 2
  • 45
  • 33
0

I found a solution for myself and this is not using any TLDs or stuff.

It uses the fact that the so called hostname is in the Host-Part of an Uri always at the second last position. Subdomains are always in front of the name and the TLD is always behind it.

See here:

private static string GetNameFromHost(string host)
{
    if (host.Count(f => f == '.') == 1)
    {
        return host.Split('.')[0];
    }
    else
    {
        var _list = host.Split('.').ToList();
        return _list.ElementAt(_list.Count - 2);
    }
}
Snickbrack
  • 1,253
  • 4
  • 21
  • 56
-1

Because of the numerous variations in domain names and the non-existence of any real authoritative list of what constitutes a "pure domain name" as you describe, I've just resorted to using Uri.Host in the past. To avoid cases where www.google.com and google.com show up as two different domains, I've often resorted to stripping the www. from all domains that contain it, since it's almost guaranteed (ALMOST) to point to the same site. It's really the only simple way to do it without risking losing some data.

Chris
  • 27,596
  • 25
  • 124
  • 225
-2
string domain = new Uri(HttpContext.Current.Request.Url.AbsoluteUri).GetLeftPart(UriPartial.Authority);
Roman C
  • 49,761
  • 33
  • 66
  • 176
craig
  • 7
  • 1