How to remove duplicate domains from a large list of URLs in c#

Question

I want to remove duplicate domains in a large list of URLs in c#,

For example if the list was:

https://example.com/example.php/login/
https://2example2.com/example/
https://example.com/register.php/
https://example.com/info/
https://example.com/example.php/login/
https://2example2.com/register/

I need to remove all of the duplicate domains except the first one

so this would be the end result:

https://example.com/example.php/login/
https://2example2.com/example/

Can anyone help me? I know how to separate the domain from the rest of the URL but I'm not sure how to keep the first one.

And where does the list comes from, text file, already a List, DB, etc — Martheen, Apr 06 '21 at 11:36

score 2 · Answer 1 · answered Apr 06 '21 at 11:46

You can use Linq for this. ie:

void Main()
{
    var paths = urls.Split(new[]{ '\n','\r'},StringSplitOptions.RemoveEmptyEntries)
        .Select(u => new Uri(u))
        .GroupBy(u => u.Host)
        .Select(u => u.First().AbsoluteUri);
    foreach (var p in paths)
    {
        Console.WriteLine(p);
    }
}

string urls = @"https://example.com/example.php/login/
https://2example2.com/example/
https://example.com/register.php/
https://example.com/info/
https://example.com/example.php/login/
https://2example2.com/register/";

score 1 · Answer 2 · answered Apr 06 '21 at 11:59

Instead of removing items you don't want, maybe simpler to write a new list with just the items you want. This is not the most efficient way to do it, but it's simple and it'll do the job.

Dictionary<string, string> domains = new Dictionary<string, string>();
foreach (string url in urls) {
    string domain = YourFunctionToSeparateTheDomainFromTheRestOfTheURL(url);
    if(!domains.ContainsKey(domain)) {
        domains.Add(domain, url);
    }
}

You now have a dictionary where the key is the domain and the value is the first url with that domain.

Please read the question. – Cetin Basoz Apr 06 '21 at 12:34 — Cetin Basoz, Apr 06 '21 at 12:34

How to remove duplicate domains from a large list of URLs in c#

2 Answers2