Parsing string for Domain / hostName

Question

Out customers can enter websites from domain names. They also can enter mailadresses from their contacts.

Know we need to find customers which websited whoose domain can be associated to the domains of the mailadresses.

So my idea is to extract the host from the webadress and from the url and compare them

So what's the most reliable algorithm to get the hostname from a url?

for example a host can be:

foo.com
www.foo.com
http://foo.com
https://foo.com
https://www.foo.com

The result should always be foo.com

point of clarification, since you deleted the example with the .vu TLD are you saying you only care about .com TLDs or is this an oversimplification? — Mike Pennington, May 24 '12 at 10:36
it's an oversimplification. it could be any kind of TLD, .de .eu .biz..... the important requirement is to find possible candidates matching mailadresses by looking at website urls — Boas Enkler, May 24 '12 at 10:39

score 15 · Accepted Answer · answered May 24 '12 at 12:39

15

Rather than relying on unreliable regex use System.Uri to do the parsing for you. Use a code like this:

string uriStr = "www.foo.com";
if (!uriStr.Contains(Uri.SchemeDelimiter)) {
    uriStr = string.Concat(Uri.UriSchemeHttp, Uri.SchemeDelimiter, uriStr);
}
Uri uri = new Uri(uriStr);
string domain = uri.Host; // will return www.foo.com

Now to get just the top-level domain you can use:

string tld = uri.GetLeftPart( UriPartial.Authority ); // will return foo.com

answered May 24 '12 at 12:39

anubhava

761,203
64
569
643

3

shouldn't tld result in just "com" ? – mikesjawnbit Mar 08 '13 at 01:43
3

@anubhava: uri.GetLeftPart(UriPartial.Authority) does not return the root domain name. Instead it returns the entire left part of the URL, starting from the scheme and ending with the port (if specified). AFAIK, the only way to ignore the sub-domain portion of the host is to explicitly truncate it using a 2-pass call to string.LastIndexOf(). – Tim Coulter Aug 06 '13 at 13:29
1

I can confirm (certainly in aspnetcore 3.1) string tld = uri.GetLeftPart( UriPartial.Authority ); // will NOT return foo.com here it will return www.foo.com (same for non-www subdomains) – MemeDeveloper Apr 02 '21 at 18:34

score 1 · Answer 2 · answered May 24 '12 at 10:15

1

Here's a regular expression that will match the url's you have provided. Basically http and https etc are optional, as is the www Everything is then matched up to a possible path;

var expression = /(https?:\/\/)?(www\.)?([^\/]*)(\/.*)?$/;

This would mean that;

var result = 'https://www.foo.com.vu/blah'.replace(expression, '$3')

Would evaluate to

result === 'foo.com.vu'

answered May 24 '12 at 10:15

cmilhench

666
5
17

the question is what about subdomains. i think they should not be included in the result. so product.mycompany.com should end up in mycompany.com – Boas Enkler May 24 '12 at 10:33
1

That could be quite difficult as you couldn't count the dots to amuse a sub-domain (I guess what I'm trying to say is things like .co.uk would mess things up). You'd probably have to do two checks, one with the expression above and one that strips of the char's before the first dot – cmilhench May 24 '12 at 10:38
This answer fails if you evaluated a DNS name with invalid characters (such as `a!notit.com`), or one with too many characters (over 63) – Mike Pennington May 24 '12 at 11:50

score 1 · Answer 3 · answered Apr 20 '13 at 13:56

1

There is already a url parser in c# for extracting this information

Here are some examples http://www.stev.org/post/2011/06/27/C-HowTo-Parse-a-URL.aspx

answered Apr 20 '13 at 13:56

score 0 · Answer 4 · answered Jan 14 '14 at 22:08

0

See this url. The Host property, unlike the Authority will not include the port number.

http://msdn.microsoft.com/en-us/library/system.uri.host(v=vs.110).aspx

answered Jan 14 '14 at 22:08

Rafi

1

Parsing string for Domain / hostName

4 Answers4

Linked