0

I have a product Catalog object with upto 1 million products in it. Following code shows the Catalog class with some test code to populate 1 million dummy products for test purpose:

public class Catalog
{
    Random random = new Random();

    long Id { get; set; }
    public string Name { get; set; }
    public List<string> Products { get; set; }

    public Catalog()
    {
        Products = new List<string>();
        addProducts();
    }

    private void addProducts()
    {
        for (int i = 0; i < 1000000; i++)
        {                
            Products.Add(random.Next(0, 100000000).ToString());
        }
    }
}

I have about 300-600 of Catalog objects (with about 1 million products each) and need to check if there are common/same products between any 2 Catalogs. Just need to check. I don't want to find out which are the same products. Logic that I am using is something like this:

static bool SearchDuplicateProducts(Catalog catalogA, Catalog catalogB)
{
    var found = false;

    foreach (string product in catalogA.Products)
    {
        if (catalogB.Products.Contains(product))
        {
            found = true;
            break;
        }
    }

    return found;
}

Of course List<string> type for products is not the fastest way to search so I tried HashSet<string>. My tests showed about 200% increase in search speed in SearchDuplicateProducts() method when I used HashSet<> over List<> to hold Products.

I am not sure though if using HashSet<string> for Product list is the best or most efficient way to achieve what I am trying in SearchDuplicateProducts(). I want to know if there any way (by using third-party library, db, trie or an algorithm) that can give me better results: in terms of space and time complexity. If there is a choice between the two then I would prefer better time complexity.

I have checked similar questions:

  1. Best Way to compare 1 million List of object with another 1 million List of object in c#
  2. How to quickly search through a very large list of strings / records on a database
  3. C#: Memory-efficient search through 2 million objects without external dependencies

Thanks for your help.

silverspoon
  • 1,085
  • 1
  • 11
  • 22
  • Although there's probably an 'optimal' way to do this in C# (always measure!), this sounds like a bigger challenge in data management. The context is important for the "solution". Different way of storing the data may be the best solution. And maybe this search is only done once, so it's not even important to put that much effort into optimizing it. And if it is run very often, you could cache unique objects in a separate nosql db and check there. Etc etc. – JHBonarius Feb 27 '21 at 09:57
  • @JHBonarius, appreciate your comment. What do you meant by "Different way of storing the data may be the best solution."? Do you mean a different data structure to save products? – silverspoon Feb 27 '21 at 10:12
  • 2
    You could check out whether the [Bloom filter](https://en.wikipedia.org/wiki/Bloom_filter) is a suitable data structure for your use case. It is a space-efficient probabilistic data structure. It can give a definitive answer `false` in case no duplicate exists, but an answer of `true` could be wrong, and would require doing the check again using a none-probabilistic algorithm. My guess is that the false positives ratio would be too high in your case, so I don't think it would be actually helpful. – Theodor Zoulias Feb 27 '21 at 10:13
  • The complexity of the search using strings is very inefficient, can you give an "Id" number for each product name? if you can create a situation that duplicates products in catalogA and catalogB has the same Id number you can build an algorithm that is based on `Binary Search` and the efficiency will increase significantly. if its possible I can post a solution. – Jonathan Applebaum Feb 27 '21 at 10:21
  • @jonathana, I think I understand your idea, thank you. But I may not be able to have same Ids for duplicate products across catalogs. But isn't very similar technic used by HashSet anyway? HashSet uses object hash (similar to ids you mentioned) to sort/index them. – silverspoon Feb 27 '21 at 10:44
  • 2
    I will list what I think should be taken into account: Ordinal string compare (see [Best practices](https://learn.microsoft.com/en-us/dotnet/standard/base-types/best-practices-strings)), [BinarySearch](https://learn.microsoft.com/en-us/dotnet/api/system.collections.generic.list-1.binarysearch?view=net-5.0) (must first sort the data), [SortKey](https://learn.microsoft.com/en-us/dotnet/api/system.globalization.sortkey?view=net-5.0) class, [Search engine indexing](https://en.wikipedia.org/wiki/Search_engine_indexing) (see [Lucene](https://www.nuget.org/packages/Lucene.Net/)) – Alexander Petrov Feb 27 '21 at 11:06

1 Answers1

1

Are you comparing each pair of the catalogs? That's insane!

First, you should purge duplicates from every catalog individually; your random.Next will likely produce a few. Or just don't insert them.

Then you should run a single loop, through all catalogs, trying to insert each object into a hash set; if it's already there - you found a duplicate.

Vlad Feinstein
  • 10,960
  • 1
  • 12
  • 27
  • random.Next() is just a dummy code I added to make my sample code easy to understand here. Obviously in real-life this won't be the case and each product in a catalog will be unique. Apart from that what do you suggest is a better way by "Are you comparing each pair of the catalogs?"? If I don't compare a catalog with each of the remaining ones, how would I know if those 2 catalogs have a common product? – silverspoon Feb 28 '21 at 00:56
  • @silverspoon the way I described will tell you if ANY of the catalogs contain duplicate items, because you will insert items from all catalog into the set. If you care which ones - just use a map catalog#> instead of the set. – Vlad Feinstein Feb 28 '21 at 01:40