1

I have a class called Product that contains some properties like Id (as Guid) and Messages (as List), and a Message class that also contains Id and other properties. I have all messages in Message table and all products in product table. After getting data of both tables, I want to join them regarding on Id property. If I use the below code as it is linear search the performance is terrible.

foreach (Product product in products)
    product.Messages = messages.Where(n => n.Id == product.Id).ToList();

Are there any other ways to do it faster?

Thanks

Babi
  • 15
  • 1
  • 4

3 Answers3

2

You might be able to speed it up by groupding your messages into a lookup table.

messagesDict = messages
    .GroupBy(x => x.Id)
    .ToDictionary(x => x.Id, x.ToList());

or, as John Bustos suggested, you can use ToLookup();

messagesDict = messages
    .ToLookup(x => x.Id);

you use it like this

//you might have to first check if messagesDict 
//actually has any messages for your project.
product.Messages = messagesDict[product.Id];

Your original attempt is O(nm) where n is the number of projects and m is the number of messages.

A Dictionary uses hashing, so from a practical standpoint, you can usually assume that it has close to O(1) inserts, and O(1) searches. Under ideal circumstances, List<T>.Add is also O(1). This means that if you were to manually create your lookup dictionary, then, you could do it in O(m). I would hope that a built-in function like ToLookup, achieves the same efficiency.

Once you do that, your algorthim becomes O(n + m)

0

You should be doing the join in the database. That'll yield the best performance. If you insist on doing this in C# sort product by Id and sort messages by ID first.

Charles
  • 640
  • 5
  • 21
  • I agree with doing it in the database since you can search a B-Tree in O(log n) time. However, sorting by ID first in C# would actually be worse in my opinion; linear search is O(n) but most sorting algorithms are either O(n log n) or O(n^2), see http://bigocheatsheet.com/ – EJoshuaS - Stand with Ukraine Sep 01 '16 at 18:51
  • @EJoshuaS Unless you have a much larger benefit after sorting by branch prediction. See http://stackoverflow.com/questions/11227809/why-is-it-faster-to-process-a-sorted-array-than-an-unsorted-array – Thomas Weller Sep 01 '16 at 18:52
  • @EJoshuaS, that's assuming that products are 1 to many with messages, which while likely, isn't at all indicated in the question. And as Thomas mentioned, branch prediction. – Charles Sep 01 '16 at 18:53
  • @ThomasWeller That's a fair point; that post you linked to explains the issue well. On the one hand, if you're doing an O(n^2) operation, for example, the algorithm as a whole couldn't possibly be any better than O(n^2), but looking at that post I can see how you could actually save time overall (even if you can't do better than the sorting algorithm in terms of computational complexity). – EJoshuaS - Stand with Ukraine Sep 01 '16 at 19:11
  • I suppose that, depending on the characteristics of the sorted list, you might be able to do a de facto binary search. Kind of reminds me of a problem I was given in an interview: suppose you have a sorted array with the ages of every individual in the world (so, for example, if you had 1,000,000 people of age 1 you'd have 1,000,000 consecutive 1's in the array). How do you count the number of people in an age group? To simplify, my answer was that the sorting lets you do a binary search variant to find the index of the first and last instance of that number and then subtract them to count. – EJoshuaS - Stand with Ukraine Sep 01 '16 at 19:24
  • @EJoshuaS, exactly why the O(n log n) sorting might be worthwhile. – Charles Sep 01 '16 at 19:29
  • @EJoshuaS you don't even have to do a binary search in this case. If you sort both projects and messages, then you can effectively iterate over both at the same time, resulting in __O(n lg(n) + m lg(m))__ runtime. – Sam I am says Reinstate Monica Sep 01 '16 at 19:29
  • @SamIam fair point, that would be a pretty effective algorithm. – EJoshuaS - Stand with Ukraine Sep 01 '16 at 19:35
0

As others have indicated, do the join in the database. As you indicated, a linear search is O(n) (and you're actually doing quite a few linear searches in this case); however, most databases use a B-Tree data structure (or something similar) to sort rows by primary key. This means that a database search on a primary key is O(log n), which is obviously dramatically faster. (Assuming, of course, that the Id is the primary key).

  • I need to have all records in client side. So, prefer to join them in C#. Anyway, ToLookup solved my issue. Thanks :) – Babi Sep 01 '16 at 20:12