Improve performance of List

Question

I have a class called Product that contains some properties like Id (as Guid) and Messages (as List), and a Message class that also contains Id and other properties. I have all messages in Message table and all products in product table. After getting data of both tables, I want to join them regarding on Id property. If I use the below code as it is linear search the performance is terrible.

foreach (Product product in products)
    product.Messages = messages.Where(n => n.Id == product.Id).ToList();

Are there any other ways to do it faster?

Thanks

Join on database side ? If you absolutely need it to work on the client side, I would use a dictionary. — Victor Zakharov, Sep 01 '16 at 18:37
Why aren't you performing the join on the database side? That's going to be an order of magnitude faster than pulling all the data in, and the performing it in C#. — Charles, Sep 01 '16 at 18:37
If you're using entityframework then you shouldn't have to do any of this — johnny 5, Sep 01 '16 at 18:43
Good question by the way, generated a lot of interesting discussion. — EJoshuaS - Stand with Ukraine, Sep 01 '16 at 19:47
I need to have all records in client side. So, prefer to join them in C#. Anyway, ToLookup solved my issue. — Babi, Sep 01 '16 at 20:15

Sam I am says Reinstate Monica · Accepted Answer · 2016-09-01T19:27:40.190

2

You might be able to speed it up by groupding your messages into a lookup table.

messagesDict = messages
    .GroupBy(x => x.Id)
    .ToDictionary(x => x.Id, x.ToList());

or, as John Bustos suggested, you can use ToLookup();

messagesDict = messages
    .ToLookup(x => x.Id);

you use it like this

//you might have to first check if messagesDict 
//actually has any messages for your project.
product.Messages = messagesDict[product.Id];

Your original attempt is O(nm) where n is the number of projects and m is the number of messages.

A Dictionary uses hashing, so from a practical standpoint, you can usually assume that it has close to O(1) inserts, and O(1) searches. Under ideal circumstances, List<T>.Add is also O(1). This means that if you were to manually create your lookup dictionary, then, you could do it in O(m). I would hope that a built-in function like ToLookup, achieves the same efficiency.

Once you do that, your algorthim becomes O(n + m)

edited Sep 01 '16 at 19:27

answered Sep 01 '16 at 18:39

Sam I am says Reinstate Monica

30,851
12
72
100

Ahh, you beat me to the punch :( – Charles Sep 01 '16 at 18:40
1

@Sam I made a mistake. Id is not int. it is Guid. So, Does your solution still work? – Babi Sep 01 '16 at 18:46
Grouping is not sorting, is it? Do we know if the Group() method internally uses a sorting algorithm? – Thomas Weller Sep 01 '16 at 18:47
@Babi which `Id` is not an int? so long as they're both the same type, it shouldn't be an issue – Sam I am says Reinstate Monica Sep 01 '16 at 18:47
@Sam Yes, they are the same. I am going to try it :) – Babi Sep 01 '16 at 18:49
So your basic idea refers to http://stackoverflow.com/questions/11227809/why-is-it-faster-to-process-a-sorted-array-than-an-unsorted-array – Thomas Weller Sep 01 '16 at 18:49
I was actually about to ask the same thing @ThomasWeller - if it uses sorting that would make it either O(n log n) or O(n^2) depending on the sorting algorithm. The other question is the computational complexity of ToDictionary - if that's O(n) too obviously doing this operation wouldn't improve the computational complexity. – EJoshuaS - Stand with Ukraine Sep 01 '16 at 18:50
@ThomasWeller no, this answer does not rely on branch prediction. – Sam I am says Reinstate Monica Sep 01 '16 at 19:13
@EJoshuaS This is a practical answer. It works for scenarios where you have a realistic number of projects and a realistic number of messages. For realistic data sets, Dictionary reads have **O(1)** runtime complexity. If you wanted to optimize this algorithm for your theory of CS homework, then you would do it differently. – Sam I am says Reinstate Monica Sep 01 '16 at 19:21
@SamIam my question was about the effects of GroupBy and ToDictionary, I think that the analysis you added answers that well though. – EJoshuaS - Stand with Ukraine Sep 01 '16 at 19:45
@Sam your solution worked for me with a huge performance improvement. Many thanks :) – Babi Sep 01 '16 at 20:08

score 0 · Answer 2 · answered Sep 01 '16 at 18:40

0

You should be doing the join in the database. That'll yield the best performance. If you insist on doing this in C# sort product by Id and sort messages by ID first.

answered Sep 01 '16 at 18:40

Charles

640
5
21

I agree with doing it in the database since you can search a B-Tree in O(log n) time. However, sorting by ID first in C# would actually be worse in my opinion; linear search is O(n) but most sorting algorithms are either O(n log n) or O(n^2), see http://bigocheatsheet.com/ – EJoshuaS - Stand with Ukraine Sep 01 '16 at 18:51
@EJoshuaS Unless you have a much larger benefit after sorting by branch prediction. See http://stackoverflow.com/questions/11227809/why-is-it-faster-to-process-a-sorted-array-than-an-unsorted-array – Thomas Weller Sep 01 '16 at 18:52
@EJoshuaS, that's assuming that products are 1 to many with messages, which while likely, isn't at all indicated in the question. And as Thomas mentioned, branch prediction. – Charles Sep 01 '16 at 18:53
@ThomasWeller That's a fair point; that post you linked to explains the issue well. On the one hand, if you're doing an O(n^2) operation, for example, the algorithm as a whole couldn't possibly be any better than O(n^2), but looking at that post I can see how you could actually save time overall (even if you can't do better than the sorting algorithm in terms of computational complexity). – EJoshuaS - Stand with Ukraine Sep 01 '16 at 19:11
I suppose that, depending on the characteristics of the sorted list, you might be able to do a de facto binary search. Kind of reminds me of a problem I was given in an interview: suppose you have a sorted array with the ages of every individual in the world (so, for example, if you had 1,000,000 people of age 1 you'd have 1,000,000 consecutive 1's in the array). How do you count the number of people in an age group? To simplify, my answer was that the sorting lets you do a binary search variant to find the index of the first and last instance of that number and then subtract them to count. – EJoshuaS - Stand with Ukraine Sep 01 '16 at 19:24
@EJoshuaS, exactly why the O(n log n) sorting might be worthwhile. – Charles Sep 01 '16 at 19:29
@EJoshuaS you don't even have to do a binary search in this case. If you sort both projects and messages, then you can effectively iterate over both at the same time, resulting in __O(n lg(n) + m lg(m))__ runtime. – Sam I am says Reinstate Monica Sep 01 '16 at 19:29
@SamIam fair point, that would be a pretty effective algorithm. – EJoshuaS - Stand with Ukraine Sep 01 '16 at 19:35

EJoshuaS - Stand with Ukraine · Answer 3 · 2016-09-01T19:33:38.363

0

As others have indicated, do the join in the database. As you indicated, a linear search is O(n) (and you're actually doing quite a few linear searches in this case); however, most databases use a B-Tree data structure (or something similar) to sort rows by primary key. This means that a database search on a primary key is O(log n), which is obviously dramatically faster. (Assuming, of course, that the Id is the primary key).

edited Sep 01 '16 at 19:33

answered Sep 01 '16 at 18:43

EJoshuaS - Stand with Ukraine

11,977
56
49
78

I need to have all records in client side. So, prefer to join them in C#. Anyway, ToLookup solved my issue. Thanks :) – Babi Sep 01 '16 at 20:12

Improve performance of List

3 Answers3