-1

I want to remove duplicate domains in a large list of URLs in c#,

For example if the list was:

https://example.com/example.php/login/
https://2example2.com/example/
https://example.com/register.php/
https://example.com/info/
https://example.com/example.php/login/
https://2example2.com/register/

I need to remove all of the duplicate domains except the first one

so this would be the end result:

https://example.com/example.php/login/
https://2example2.com/example/

Can anyone help me? I know how to separate the domain from the rest of the URL but I'm not sure how to keep the first one.

2 Answers2

2

You can use Linq for this. ie:

void Main()
{
    var paths = urls.Split(new[]{ '\n','\r'},StringSplitOptions.RemoveEmptyEntries)
        .Select(u => new Uri(u))
        .GroupBy(u => u.Host)
        .Select(u => u.First().AbsoluteUri);
    foreach (var p in paths)
    {
        Console.WriteLine(p);
    }
}

string urls = @"https://example.com/example.php/login/
https://2example2.com/example/
https://example.com/register.php/
https://example.com/info/
https://example.com/example.php/login/
https://2example2.com/register/";
Cetin Basoz
  • 22,495
  • 3
  • 31
  • 39
1

Instead of removing items you don't want, maybe simpler to write a new list with just the items you want. This is not the most efficient way to do it, but it's simple and it'll do the job.

Dictionary<string, string> domains = new Dictionary<string, string>();
foreach (string url in urls) {
    string domain = YourFunctionToSeparateTheDomainFromTheRestOfTheURL(url);
    if(!domains.ContainsKey(domain)) {
        domains.Add(domain, url);
    }
}

You now have a dictionary where the key is the domain and the value is the first url with that domain.

Nicholas Hunter
  • 1,791
  • 1
  • 11
  • 14