I have a product Catalog object with upto 1 million products in it. Following code shows the Catalog class with some test code to populate 1 million dummy products for test purpose:
public class Catalog
{
Random random = new Random();
long Id { get; set; }
public string Name { get; set; }
public List<string> Products { get; set; }
public Catalog()
{
Products = new List<string>();
addProducts();
}
private void addProducts()
{
for (int i = 0; i < 1000000; i++)
{
Products.Add(random.Next(0, 100000000).ToString());
}
}
}
I have about 300-600 of Catalog objects (with about 1 million products each) and need to check if there are common/same products between any 2 Catalogs. Just need to check. I don't want to find out which are the same products. Logic that I am using is something like this:
static bool SearchDuplicateProducts(Catalog catalogA, Catalog catalogB)
{
var found = false;
foreach (string product in catalogA.Products)
{
if (catalogB.Products.Contains(product))
{
found = true;
break;
}
}
return found;
}
Of course List<string>
type for products is not the fastest way to search so I tried HashSet<string>
. My tests showed about 200% increase in search speed in SearchDuplicateProducts()
method when I used HashSet<>
over List<>
to hold Products.
I am not sure though if using HashSet<string>
for Product list is the best or most efficient way to achieve what I am trying in SearchDuplicateProducts()
. I want to know if there any way (by using third-party library, db, trie or an algorithm) that can give me better results: in terms of space and time complexity. If there is a choice between the two then I would prefer better time complexity.
I have checked similar questions:
- Best Way to compare 1 million List of object with another 1 million List of object in c#
- How to quickly search through a very large list of strings / records on a database
- C#: Memory-efficient search through 2 million objects without external dependencies
Thanks for your help.