9

We've got a slight performance issue in a section of code that's using LINQ and it's raised a question about how LINQ works in terms of lookups

My question is this (please note that I have changed all the code so this is an indicative example of the code, not the real scenario):

Given

public class Person {
 int ID;
 string Name;
 DateTime Birthday; 
 int OrganisationID;
}

If I had a list of say 100k Person objects and then a list of dates, say 1000, and I ran this code:

var personBirthdays = from Person p in personList
    where p.OrganisationID = 123
    select p.Birthday;

foreach (DateTime d in dateList)
{
    if (personBirthdays.Contains(d))
        Console.WriteLine(string.Format("Date: {0} has a Birthday", d.ToShortDateString()));
}

The resultant code would be an iteration of:

100k (the loop that needs to be done to find the users with the organisationID 123)
multiplied by
1000 (the amount of dates in the list)
multiplied by
x (the amount of users who have the organisationID 123 to be checked against for the date)

This is a lot of iterations!

If I changed the code the personBirthdays to this:

List<DateTime> personBirthdays = 
        (from Person p in personList
        where p.OrganisationID = 123
        select p.Birthday).ToList();

This should remove the 100k as a multiple by, and just do it once?

So you would have 100k + (1000 * x) instead of (100k * 1000 * x).

The question is that this seems too easy, and I'm sure the LINQ is doing something clever somewhere that should mean that this doesn't happen.

If no one answers, I'll run some tests and report back.

Clarity update: We're not considering Database lookups, the personList object is an In Memory list object. This all LINQ-to-Objects.

Martin
  • 2,180
  • 4
  • 21
  • 41
  • @HamletHakobyan Given the question he's asking, it doesn't matter. – Servy Dec 11 '12 at 16:05
  • Yes. In first snippet you have `dateList.Count` iterations of `personBirthdays`. – Hamlet Hakobyan Dec 11 '12 at 16:10
  • Hey man, Search for a solution to generate an IN query using EF by passing the `dateList` as parameter, that will solve your problem inherently if your `dateList` Length is not very long. – Jahan Zinedine Dec 11 '12 at 16:12
  • @Jani We have no ides if he's using EF or not. He could be doing linq to objects, or using another provider. – Servy Dec 11 '12 at 16:14
  • @Servy , if `personList` is `IQueryable`, `personBirthdays` will be transated to appropriated context query and in `foreach` loop it will be `dateList.Count` query to database instead of 100k iterations in `IEnumerable`. – Hamlet Hakobyan Dec 11 '12 at 16:15
  • And to all those who articulated the problem as `N+1`, ** IT'S NOT THAT PROBLEM **, `N+1` is about eager loading and lazy loading association, the current problem is about iterating an IQueryable multiple times which obviously cause running the query multiple times. – Jahan Zinedine Dec 11 '12 at 16:17
  • @Servy and why `dateList.Count + 1` ? – Hamlet Hakobyan Dec 11 '12 at 16:17
  • @HamletHakobyan nevermind, nothing to see here. – Servy Dec 11 '12 at 16:18
  • Take a look at similar problem http://stackoverflow.com/questions/8107439/entity-framework-4-1-most-efficient-way-to-get-multiple-entities-by-primary-key – Jahan Zinedine Dec 11 '12 at 16:20
  • Whomever downvoted, please say why so I can correct... – Martin Dec 11 '12 at 22:28

3 Answers3

8

This should remove the 10k as a multiple by, and just do it once?

What it means is that instead of iterating personList 100k times, performing the where and select operations for each of those iterations that you'll be iterating the resulting List 100k times, and that the where and select operations will only have been performed on the underlying data source once.

The question is that this seems too easy, and I'm sure the LINQ is doing something clever somewhere that should mean that this doesn't happen.

Nope, your first query is simply something that you shouldn't be doing using LINQ, you should be taking the results of the query and placing them into a data structure if you plan to iterate over them many times (which is what you changed).

You can improve this query even more by using the appropriate data structure. Searching on a List is rather inefficient, as it needs to do a linear search. It would be preferable to use a HashSet to store the results of the query. A HashSet has O(1) search speed in the average case, as opposed to O(n) search time of a List.

var dates = new HashSet<DateTime>(from Person p in personList
                                  where p.OrganisationID = 123
                                  select p.Birthday);

foreach (DateTime d in dateList.Where(date => dates.Contains(date)))
{
    Console.WriteLine(string.Format("Date: {0} has a Birthday", d.ToShortDateString()));
}
Servy
  • 202,030
  • 26
  • 332
  • 449
  • wouldn't your foreach actually by a worse way to do it? since the Where would force an iteration, then another iteration needs to be done to print out the dates? or would the dates.contains do something fancy? could there potentially be an `intersect` instead? other than that, great answer... exactly what I was looking for. Now to find a way to tell the devs... – Martin Dec 11 '12 at 22:35
  • By putting the early query into a `HashSet` it eagerly evaluates that query and resolves it to the individual dates. Then, in the `foreach` it goes through each date in `dateList` and performs a `Contains` on that set. `Contains`, on a `HashSet`, is a *very* efficient operation. – Servy Dec 12 '12 at 01:00
3

This is typical select n+1 problem, and after you applied .ToList() you have partially solved it. Next step could be following: you're constantly iterating over personBirthdays list, replace it with HashSet and you could perform Contains(d) much much faster and remove duplicates:

var personBirthdays = new HashSet<DateTime>((from Person p in personList
    where p.OrganisationID = 123
    select p.Birthday).ToArray());
Akim
  • 8,469
  • 2
  • 31
  • 49
  • 2
    `HashSet`'s constructor takes an `IEnumerable`, there's no need for the `ToArray`. – Servy Dec 11 '12 at 16:04
  • Or use `.ToDictionary()` instead of `.ToList()` – Christopher Stevenson Dec 11 '12 at 16:04
  • 3
    @ChristopherStevenson But he doesn't logically have a key/value relationship; it's appropriate to use a HashSet instead of a Dictionary. It's a shame that the didn't include `ToHashSet` in LINQ, but you can make your own method trivially easy on your own if you would like. – Servy Dec 11 '12 at 16:06
  • 3
    TaDa: `public static HashSet ToHashSet(this IEnumerable src) { return new HashSet(src); }` – Scott Chamberlain Dec 11 '12 at 16:12
  • The ToArray() is not only unnecessary it's acctually a huge performance hogger. – Thomas Lindvall Dec 11 '12 at 16:13
  • @ThomasLindvall I don't know about *huge*. Since the resulting sequence is being immediately evaluated and put into a single in-memory data structure it's not like it's changing semantics of when it's evaluated, or that you could avoid putting it all in memory to begin with. It does hurt, just not *hugely* so. – Servy Dec 11 '12 at 16:16
  • @Servy try this: var foo = new List(); for (int i = 0; i < 100000; i++) { foo.Add(i); } var sw = Stopwatch.StartNew(); var foobar = foo.Where(f => f > 10000).ToArray(); sw.Stop(); Console.WriteLine(sw.Elapsed); Compare with and without ToArray() – Thomas Lindvall Dec 11 '12 at 16:45
  • @ThomasLindvall I took those 90,000 items and put them in a hash set and measured the difference in time between using the `IEnumerable` constructor vs adding a `ToArray` call. I then saved the difference in time in milliseconds over 1000 runs (ignoring the first few for jitter warmup) and the result that using `ToArray` takes just over 1 millisecond longer. I don't think that 1 ms (on a 90k item set) qualifies as *huge*, but it's certainly something. Note that in your code removing `ToArray` means the sequence is never iterated; it's a common mistake in benchmarking with `IEnumerable`. – Servy Dec 11 '12 at 17:01
  • @Servy Ah, a `Hashset` in this case. I learn something new every day. – Christopher Stevenson Dec 11 '12 at 20:39
  • @Servy your test result is quite intresting, I got about an 25% higher result on my slowest ToArray() compared to the slowest IEnumerable, I only ran it about 100 times though taking the "top" 10 results when comparing. Your test seems a lot more precise so I'm going to trust you here. – Thomas Lindvall Dec 12 '12 at 08:40
  • @ThomasLindvall When benchmarking you shouldn't be taking either just the best or just the worst case, unless you're specifically trying to target best/worse case times. Usually you want to specifically ignore outliers and average the rest, if you want to focus on total throughput. A realtime system where all that matters is being less than X time might care more about worse case than average case, but that's rare. – Servy Dec 12 '12 at 14:55
0

I'm assuming that you are referring to LINQ-to-Objects, as each LINQ provider has its own implementation (LINQ-to-SQL, LINQ-to-Entities, LINQ-to-XML, LINQ-to-anything).

Taking your example of personBirthdays, it is not a foregone conclusion that the expression was created for the purpose of iterating through the full result set, so LINQ cannot automatically materialize the results to an array or list.

These operations are very different:

personBirthdays.Distinct()
personBirthdays.FirstOrDefault(b => b.Month == 7)
personBirthdays.Select(b => b.Year).Distinct()

What LINQ as a technology does that is "clever" is to allow the construction of an expression tree and to defer execution. This is what prevents--in the 3rd example above--100k iteration to get birthdays, then another 100k to choose the year, then a final, costly pass to assemble the distinct values.

The LINQ consumer (you) has to own the destiny of the expression. If you know that the result set will be iterated over multiple times, the onus is on you to materialize them to an array or list.

Jay
  • 56,361
  • 10
  • 99
  • 123
  • `it is not a foregone conclusion that the expression was created for the purpose of iterating` Well, actually, it is. It *must* implement `IEnumerable` given the context, and defining that enumeration is *exactly* what it means. `What LINQ as a technology does that is "clever" is to allow the construction of an expression tree and [...]` You stated Linq to objects, but for that there is no expressions or expression trees involved, that's only in the `IQueryable` side of things. It defers execution by holding onto references to iterators and delegates. – Servy Dec 11 '12 at 16:12
  • I'd like to thank you for you're answer, however, Servy's was the more clear on the iteration differences. – Martin Dec 11 '12 at 22:36