1

Lets say I have this code:

Uri uri = new Uri("www.xx.yy.co.uk/folder/whatever.html");

How can I get xx , yy , co.uk from a Uri in C# ? I tried nearly every property of Uri class and I didn't find anything relevant.

Note that, for example, com and co.uk are both a single string.

dimitris93
  • 4,155
  • 11
  • 50
  • 86

3 Answers3

1

As you've found the inbuilt System.Uri doesn't break out the various top level (host/domain) parts of a URL. The type of parsing you are asking for is quite specific, as .com and .co.uk are not equivalent components within the URL (.com and .uk are).

Two easy ways to do this yourself are:

  • modify an established regex for parsing the URL held in the Host property of the Uri, and use named captures (groups) in the regex to conveniently extract the portions.

  • extend the System.Uri class by creating your own that inherits from it, and introduce a method that breaks down the URL in the specific way you want.

Community
  • 1
  • 1
slugster
  • 49,403
  • 14
  • 95
  • 145
  • So basically make a list of all the `.com` `.org` `.co.uk` etc is the only solution ? – dimitris93 May 06 '15 at 01:49
  • @Shiro You don't need to make or keep a list. This could be done in a regex, but it would be complicated. It might be simpler if you just split the `Host` on each `.`, then recombine the last two strings in the resulting array if they meet a certain condition. Note that there are all sorts of domain and ccTLD and TLD combinations possible which will make things very complicated, so ultimately the aforementioned string.Split() might be the way to go. – slugster May 06 '15 at 01:56
  • 2
    Note that from URI point of view host has no separate components. Top level domain/secondary domain have meaning from DNS point of view, but there is no formal way to look at host name and somehow figure out what part should correspond to "country level"... See http://stackoverflow.com/questions/14427817/list-of-all-top-level-domains for possible location of TLD list – Alexei Levenkov May 06 '15 at 02:27
1

The problem is that there is a very large lists of "pseudo top-level domains", such as co.uk, wakayama.jp or edu.cn, or even "top-level domains" with three parts. There is no built in list for all of them in C#, so the best solution that I can see is to specify the ones that you expect and separate on them, as of below:

List<string> parts = null;
Uri uri = new Uri("http://www.xx.yy.co.uk/folder/whatever.html");
string s = uri.Host;
string[] twoLevelDomains = { "co.uk", "edu.cn" };
foreach(var twoLevelDomain in twoLevelDomains)
{
    if (s.EndsWith(twoLevelDomain))
    {
        parts = s.Replace("." + twoLevelDomain, "").Split('.').ToList();
        parts.Add(twoLevelDomain);
    }
}
if(parts == null) {
    parts = s.Split('.').ToList();
}

Background: The only official top-level domains are just one part, such as .uk. A somewhat comprehensive list of all the "pseudo top-level domains" is available here: https://wiki.mozilla.org/TLD_List . While it is a big list, it is still does not seem comprehensive, since many countries are listed with just 1 top domain and there are fields such as "(others ?)".

Mattias
  • 1,110
  • 1
  • 11
  • 26
-1

This will work for this issue. Examine the array elements:

 Uri uri = new Uri("http://www.xx.yy.co.uk/folder/whatever.html");
 string abs = uri.AbsoluteUri;

 char[] splitChar = { '.' };
 var nodesArray = abs.Split(splitChar).ToArray();
Gregg
  • 615
  • 6
  • 6