Linq select match list from other list performance

Question

I have a class like this

public class Category 
{
   CategoryID
   List<Product> ProductList
}
public class Product 
{
   CategoryID
}

List<Category> around 15k row and List <Product> around 40k row. I'm doing with LINQ like this

CategoryList.ForEach(i =>
{
    i.ProductList = ProductList.Where(x => x.CategoryID == i.CategoryID).ToList();
});

I want to know there are any implement better performance for this. Data could increase or decrease in other case

Have you considered using a database over in memory collections? — CodesInChaos, Aug 17 '15 at 10:06

CodesInChaos · Accepted Answer · 2015-08-17T09:49:58.167

12

I think it's silly to use parallelism, when a simple change to the algorithm can speed it up thousandfold. Assuming CategoryID has a decent implementation of GetHashCode() a dictionary/lookup has nearly O(1) lookup times compared to the O(n) time of scanning through a list.

Two possible approaches:

Turn categories into a dictionary

Assuming categories have unique IDs you can use:

var categoryById = categories.ToDictionary(c => c.CategoryID);
foreach(var product in products)
{
    var category = categoryById[product.CategoryID];
    category.Products.Add(product);
}

This has runtime O(products.Count + categories.Count) and uses O(categories.Count) memory.

If categories don't start out with an empty list, you might need to create it.

Turn products into a Lookup

var productsByCategory = products.ToLookup(product => product.CategoryID);
foreach(var category in categories)
{
    category.Products = products[category.CategoryID].ToList();
}

This has runtime O(products.Count + categories.Count) and uses O(products.Count) memory.

Since there are typically more products than categories, this approach will require more memory. On the other hand the lookup might remove the need to embed a list of products in the category object.

edited Aug 17 '15 at 09:49

answered Aug 17 '15 at 09:34

CodesInChaos

106,488
23
218
262

And when it go to 40k of categories and 1m2 of products I think this will consume a lot of memory – user1392853 Aug 17 '15 at 09:54
1

@user1392853 The first approach will use something like 1-2 MB for 40k categories. That's much smaller than the data you're already keeping in memory. You can use that dictionary whenever you need to lookup a category. Compare this to the other algorithms which are already slow for your current amount of data and will slow down by a factor of 70 when scaling up to the amounts you mentioned in your comment. – CodesInChaos Aug 17 '15 at 10:05
1

*Don't forget that Big-O notations only says how the complexity grows with respect to the size (etc) - it doesn't give any indication of the constant factors involved. That's why sometimes even a linear search for keys is faster than a dictionary lookup...* This is quote from Jon Skeet's comment about this matter that can be found here : http://stackoverflow.com/questions/908050/optimizing-lookups-dictionary-key-lookups-vs-array-index-lookups – Fabjan Aug 17 '15 at 10:13
1

@Fabjan Not sure how that's relevant here. The slightly bigger overhead of a dictionary lookup compared to a list lookup isn't even close to the advantage of not having to scan through 10k+ categories for each product. – CodesInChaos Aug 17 '15 at 10:18

score 5 · Answer 2 · answered Aug 17 '15 at 07:00

5

You can use parallel implementation of LINQ - PLINQ. Usually PLINQ increases the speed of LINQ, because it's using all available cores more efficiently.

CategoryList.AsParallel().ForAll(i =>
     {
         i.ProductList = ProductList.AsParallel.Where(x => x.CategoryID == i.CategoryID).ToList();
     });

answered Aug 17 '15 at 07:00

Artem Kulikov

2,250
19
32

2

It's pretty silly to parallelize inefficient code running in O(n*m) when a small change running in O(n+m) would result in a thousandfold improvement in performance. – CodesInChaos Aug 17 '15 at 09:20
Well, I must admit, `ToLookUp` is better variant. `Dictionary` isn't suitable, because ID doesn't unique. Didn't know about this class, thank you for notice. – Artem Kulikov Aug 17 '15 at 10:03
1

CategoryID isn't unique among products, but it's probably unique among categories. So you can turn the categories into a dictionary. And of course a `Dictionary>` can function as a lookup (useful if you need a mutable lookup, since .NET only ships with an immutable lookup). – CodesInChaos Aug 17 '15 at 10:08
PLINQ also has the "downside" of being unordered by default, which might be an issue. There's a few other things that can effect whether or not it's actually faster, although `Where` can be trivially parallelized. – Clockwork-Muse Aug 17 '15 at 10:56

tafia · Answer 3 · 2015-08-17T10:05:37.433

4

You can also use a GroupJoin instead of multiple where: https://msdn.microsoft.com/en-us/library/bb397905.aspx

While Parallel will offer you up to nb core enhancement, you can obtain a linear complexity with GroupJoin.

From @CodesInChaos

If the number of categories and products are on the same scale, you'll get a speedup from quadratic to linear runtime

edited Aug 17 '15 at 10:05

answered Aug 17 '15 at 07:52

tafia

1,512
9
18

"GroupJoin can provide exponential boost." Since the original algorithm has polynomial runtime an exponential speedup is impossible. If the number of categories and products are on the same scale, you'll get a speedup from quadratic to linear runtime. – CodesInChaos Aug 17 '15 at 09:40
I thought it would be *from quadratic to n*log(n)*. How is it linear ? – tafia Aug 17 '15 at 09:54
Typically we consider a lookup in a hash table as O(1) despite the potential complications. O(log(n)) lookup is typical for ordered/tree-based lookup tables. But the distinction between O(1) and O(log(n)) is rather academic, since logarithms grow so slowly. – CodesInChaos Aug 17 '15 at 09:57

score 3 · Answer 4 · answered Aug 17 '15 at 07:50

You may gain performance when organizing the product list grouped per category such as Dictionary<int, List<Product>>. You will then only have to pick the category's product list by its key instead of executing a LINQ query over the whole product list and generating new sub-lists.

CategoryList.ForEach(i =>
{
    i.ProductList = ProductDict[i.CategoryID];
});

Fabjan · Answer 5 · 2015-08-17T10:08:41.677

2

After testing performance i found out that better solution than one i posted before is this :

    static List<Category> FilterList(List<Category> list, List<Product> ProductList)
    {
        Parallel.For(0, list.Count, i =>
        {
            list[i].ProductList = new List<Product>();
            for (int i2 = 0; i2 < ProductList.Count; i2++)
                if (ProductList[i2].CategoryID == list[i].CategoryID) list[i].ProductList.Add(ProductList[i2]);
        });            

        return list;
    }

And then list = FilterList(list, products);

For 3,500,000 operations it took only 200ms

edited Aug 17 '15 at 10:08

answered Aug 17 '15 at 07:33

Fabjan

13,506
4
25
52

It's like @Artem answear. – user1392853 Aug 17 '15 at 07:49
@user1392853 my answer uses `Parallel.For` and not Linq – Fabjan Aug 17 '15 at 07:50
Thanks for your suggest – user1392853 Aug 17 '15 at 08:17
1) This still has quadratic runtime. 2) Using a second parallel for is probably slower than using a sequential `for`. The outer parallel for should already provide all the parallelism you need, with the inner only adding overhead. – CodesInChaos Aug 17 '15 at 09:44
Your code isn't even thread-safe since you're adding to a `List` from multiple threads. So the inner `Parallel.For` *incorrect* and not just useless/inefficient. – CodesInChaos Aug 17 '15 at 09:51

score 2 · Answer 6 · answered Aug 17 '15 at 07:37

2

Actually most efficient solution in such case would be creation of CategoryList from ProductList. Something like this:

CategoryList = ProductList
  .GroupBy(p=>p.CategoryID)
  .Select(
    g=>
      new Category{
        CategoryID=g.Key,
        ProductList = g.ToList()
      }
  );

answered Aug 17 '15 at 07:37

Vitaliy Kalinin

1,791
12
20

there are more than only one property in CategoryList – user1392853 Aug 17 '15 at 07:43
Then you should use GroupJoin. – Vitaliy Kalinin Aug 17 '15 at 07:56

score 2 · Answer 7 · answered Aug 17 '15 at 09:43

If CategoryList already exists, you can first groupd the products by CategoryID, and then join the grouped products with the categories.

var query =
    from p in ProductList
    group p by p.CategoryID
        into grouped        
    join c in CategoryList
        on grouped.Key equals c.CategoryID
    select new { c, grouped };

Then use a loop to set the ProductList field of the Category objects.

foreach (var item in query)
    item.c.ProductList = item.grouped.ToList();

Linq select match list from other list performance

7 Answers7

Turn categories into a dictionary

Turn products into a Lookup