How to Extract Domain name from string with Regex in C#?

Question

I want extract Top-Level Domain names and Country top-level domain names from string with Regex. I tested many Regex like this code:

var linkParser = new Regex(@"\b(?:https?://|www\.)\S+\b", RegexOptions.Compiled | RegexOptions.IgnoreCase);
Match m = linkParser.Match(Url);
Console.WriteLine(m.Value);

But none of these codes could do it properly. The text string entered by the user can be in the following statements:

jonasjohn.com
http://www.jonasjohn.de/snippets/csharp/
jonasjohn.de
www.jonasjohn.de/snippets/csharp/
http://www.answers.com/article/1194427/8-habits-of-extraordinarily-likeable-people
http://www.apple.com
https://www.cnn.com.au
http://www.downloads.news.com.au
https://ftp.android.co.nz
http://global.news.ca
https://www.apple.com/
https://ftp.android.co.nz/
http://global.news.ca/
https://www.apple.com/
https://johnsmith.eu
ftp://johnsmith.eu
johnsmith.gov.ae
johnsmith.eu
www.jonasjohn.de
www.jonasjohn.ac.ir/snippets/csharp
http://www.jonasjohn.de/
ftp://www.jonasjohn.de/
https://subdomain.abc.def.jonasjohn.de/test.htm

The Regex I tested:

^(?:https?:\/\/)?(?:[^@\/\n]+@)?(?:www\.)?([^:\/\n]+)"

\b(?:https?://|www\.)\S+\b

://(?<host>([a-z\\d][-a-z\\d]*[a-z\\d]\\.)*[a-z][-a-z\\d]+[a-z])

and also too many I just need the domain name and I don't need a protocol or a subdomain. Like: Domainname.gTLD or DomainName.ccTLD or DomainName.xyz.ccTLD

I got list of them from PUBLIC SUFFIX

Of course, I've seen a lot of posts on stackoverflow.com, but none of it answered me.

Why would you use Regex, if you have [Uri](https://learn.microsoft.com/en-us/dotnet/api/system.uri?view=net-5.0) ? — Fildor, Jul 05 '21 at 11:58
Does [this](https://stackoverflow.com/a/14212007/9363973) answer solve it? — MindSwipe, Jul 05 '21 at 11:58
@LeiYang Did you check that against OP's list of possible input examples? — Fildor, Jul 05 '21 at 12:01
@Fildor i tried in some online tester, which line do you think not match? — Lei Yang, Jul 05 '21 at 12:04
@Fildor I just noticed a problem when using `Uri`, in that `new Uri("www.jonasjohn.de")` will throw an exception as the format can not be determined. Check out the [demo](https://dotnetfiddle.net/EoxoKl) I put together — MindSwipe, Jul 05 '21 at 12:04
@MindSwipe Yep, found that, too: https://dotnetfiddle.net/mixx9C. Nevertheless, I'd do it for a "first round", then look at the "dead letter queue". — Fildor, Jul 05 '21 at 12:13

Panagiotis Kanavos · Answer 1 · 2021-07-05T12:56:24.477

3

You don't need a Regex to parse a URL. If you have a valid URL, you can use one of the Uri constructors or Uri.TryCreate to parse it:

if(Uri.TryCreate("http://google.com/asdfs",UriKind.RelativeOrAbsolute,out var uri))
{
    Console.WriteLine(uri.Host);
}

www.jonasjohn.de/snippets/csharp/ and jonasjohn.de/snippets/csharp/ aren't valid URLs though. TryCreate can still parse them as relative URLs, but reading Host throws System.InvalidOperationException: This operation is not supported for a relative URI.

In that case you can use the UriBuilder class, to parse and modify the URL eg:

var bld=new UriBuilder("jonasjohn.com");
Console.WriteLine(bld.Host);

This prints

jonasjohn.com

Setting the Scheme property produces a valid,complete URL:

bld.Scheme="https";
Console.WriteLine(bld.Uri);

This produces:

https://jonasjohn.com:80/

edited Jul 05 '21 at 12:56

answered Jul 05 '21 at 12:13

Panagiotis Kanavos

120,703
13
188
236

It's sounds good but One problem if yo input like "jonasjohn.com" get this error: This operation is not supported for a relative URI. – Feri Jul 05 '21 at 12:48
1

@Feri using what code? `UriBuilder` works. As for Uri, I already explained this doesn't work because ... why assume this is a domain instead of the fifth part of a relative Url? `jonasjohn.com` isn't a valid URL, but it's a *valid* relative URL. `http://mysite/jonasjohn.com` is a valid URL. So is `http://mysite` – Panagiotis Kanavos Jul 05 '21 at 12:51

score 2 · Accepted Answer · answered Jul 05 '21 at 13:49

According to Lidqy answer, I wrote this function, which I think supports most possible situations, and if the input value is out of this, you can make it an exception.

public static string ExtractDomainName(string Url)
        {
            var regex = new Regex(@"^((https?|ftp)://)?(www\.)?(?<domain>[^/]+)(/|$)");

            Match match = regex.Match(Url);

            if (match.Success)
            {
                string domain = match.Groups["domain"].Value;
                int freq = domain.Where(x => (x == '.')).Count();
                while (freq > 2)
                {
                    if (freq > 2)
                    {
                        var domainSplited = domain.Split('.', 2);
                        domain = domainSplited[1];
                        freq = domain.Where(x => (x == '.')).Count();
                    }
                }
                return domain;
            }
            else
            {
                return String.Empty;
            }
        }

score 1 · Answer 3 · answered Jul 05 '21 at 12:09

var rx = new Regex(@"^((https?|ftp)://)?(www\.)?(?<domain>[^/]+)(/|$)");
var data = new[] {           "jonasjohn.com",
                             "http://www.jonasjohn.de/snippets/csharp/",
                             "jonasjohn.de",
                             "www.jonasjohn.de/snippets/csharp/",
                             "http://www.answers.com/article/1194427/8-habits-of-extraordinarily-likeable-people",
                             "http://www.apple.com",
                             "https://www.cnn.com.au",
                             "http://www.downloads.news.com.au",
                             "https://ftp.android.co.nz",
                             "http://global.news.ca",
                             "https://www.apple.com/",
                             "https://ftp.android.co.nz/",
                             "http://global.news.ca/",
                             "https://www.apple.com/",
                             "https://johnsmith.eu",
                             "ftp://johnsmith.eu",
                             "johnsmith.gov.ae",
                             "johnsmith.eu",
                             "www.jonasjohn.de",
                             "www.jonasjohn.ac.ir/snippets/csharp",
                             "http://www.jonasjohn.de/",
                             "ftp://www.jonasjohn.de/",
                             "https://subdomain.abc.def.jonasjohn.de/test.htm"
                         };

        foreach (var dat in data) {
            var match = rx.Match(dat);
            if (match.Success)
                Console.WriteLine("{0} => {1}", dat, match.Groups["domain"].Value);
            else {
                Console.WriteLine("{0} => NO MATCH", dat);
            }
        }

Thanks for answer it working on some but not working with like this "https://subdomain.abc.def.jonasjohn.de/test.htm" — Feri, Jul 05 '21 at 12:41
Afaics this convention to add a 2-letter suffix after .com or .co is restricted to .uk, .nz, .au and maybe some others commonwealth domains so try if with this: `@"^((https?|ftp)://)?(www\.)?[\w\.]*?(?\w+\.\w+)(\.(uk|au|nz|ir|ae))?(/|$)"` — lidqy, Jul 05 '21 at 13:09

How to Extract Domain name from string with Regex in C#?

3 Answers3