7

Looking at a section of the webapp I work on today with a performance profiler. I thought that a Union was causing some delays, but found other surprising results instead.

One of the causes of slowdown appeared to be FirstOrDefault.

It was a very simple LINQ query that looked like this:

foreach(Report r in reports)
    IDTOStudy study = studies.FirstOrDefault(s => s.StudyID == r.StudyID);

I created a small method to duplicate the behaviour I figured FirstOrDefault was doing.

private IDTOStudy GetMatchingStudy(Report report, IList<IDTOStudy> studies)
{
    foreach (var study in studies)
    if (study.StudyID == report.StudyID)
        return study;

    return null;
}

This method replaced the FirstOrDefault to look like this:

foreach(Report r in reports)
    IDTOStudy study = GetMatchingStudy(r, studies);

Looking at the new code running with the performance profiler showed FirstOrDefault to take twice as long to complete as my new method. This was a shock to see.

I must be doing something incorrect with the FirstOrDefault() query. What is it?

Does FirstOrDefault() complete the entire query and then take the first element?

How can I speed this up and use FirstOrDefault()?

Edit 1:

One additional point I noticed is that the profiler says I'm maxing out my CPU on both of these implementations. That's also something I don't care for and didn't expect. The additional method I added didn't reduce that spike, just cut its duration in half.

Edit 3:

Putting the studies into a dictionary has improved the run time tremendously. It's definitely going to be how the committed code looks. Doesn't answer the question on FirstOrDefault though.

Edit 2:

Here's the sample code requested in a simple console app. My run still shows that in most cases FirstOrDefault takes longer.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Reactive.Linq;
using System.Reactive.Concurrency;
using System.Diagnostics;

namespace TestCode
{
    public class Program
    {
        public List<IntHolder> list;

        public static void Main(string[] args)
        {
            var prog = new Program();
            prog.list = new List<IntHolder>();

            prog.Add50000Items();
            prog.list.Add(new IntHolder() { Num = 12345 });
            prog.Add50000Items();

            var stopwatch = new Stopwatch();
            stopwatch.Start();
            prog.list.FirstOrDefault(n => n.Num == 12345);
            stopwatch.Stop();

            Console.WriteLine("First run took: " + stopwatch.ElapsedTicks);
            var lookingFor = new IntHolder() { Num = 12345 };

            stopwatch.Reset();
            stopwatch.Start();
            prog.GetMatching(lookingFor);
            stopwatch.Stop();
            Console.WriteLine("Second run took: " + stopwatch.ElapsedTicks);
            Console.ReadLine();
        }

        public void Add50000Items()
        {
            var rand = new Random();

            for (int i = 0; i < 50000; i++)
                list.Add(new IntHolder() { Num = rand.Next(100000) });
        }

        public IntHolder GetMatching(IntHolder num)
        {
            foreach (var number in list)
                if (number.Num == num.Num)
                    return number;

            return null;
        }
    }

    public class IntHolder
    {
        public int Num { get; set; }
    }
}
Chris
  • 707
  • 5
  • 13
  • If you are using Linq to SQL or EF, then `FirstOrDefault()` should simply generate `TOP 1` query – Sergey Berezovskiy Apr 18 '13 at 20:19
  • what is studies, is it an orm object (eg a dbset from Entity Framework or similar)? – undefined Apr 18 '13 at 20:21
  • @lazyberezovsky my suspicion is that the first example generates O(n) and the second O(1) db queries – undefined Apr 18 '13 at 20:22
  • 4
    My first instinct is that the error is in how you're profiling the code, rather than in how it's executing. Can you provide an example program that we can run that demonstrates the difference between `FirstOrDefault` and a custom method of yours along that same pattern? – Servy Apr 18 '13 at 20:26
  • I'll write something up and see if it's still an issue, Servy. For reference I'm using ANTS Performance Profiler 6 and attaching it to the .Net 4 Process. – Chris Apr 18 '13 at 20:36
  • 1
    LINQ-to-objects being a factor 2 slower than equivalent hand written code is standard. The extra cost is mostly due to additional indirect calls (delegates, interfaces) – CodesInChaos Apr 18 '13 at 20:50
  • 2
    Why don't you use appropriate datastructures for lookups like dictionaries? Both your algorithms are O(reports.Count*studies.Count). With proper datastructures this would become O(reports.Count) – CodesInChaos Apr 18 '13 at 20:52
  • CodesInChaos, that could be an excellent idea. I'm going to create two dictionaries with the study id as key and see if this speeds up. The reason I didn't is because this lookup is not done often and I wasn't sure if it was worth the additional overhead of creating the dictionaries. – Chris Apr 18 '13 at 21:07
  • CodesInChaos, thanks! Throwing the studies in a dictionary has improved the run time tremendously. Before I started it was about 30 seconds to load the page. I had it down to about 5. This does it in less than a second. – Chris Apr 18 '13 at 21:24

1 Answers1

4

What i think is happening (although it would be good to get a little extra info on your specific scenario, my assumption is that this is a DB scenario based on your DTO classes) is the following:

foreach(Report r in reports)
    IDTOStudy study = studies.FirstOrDefault(s => s.StudyID == r.StudyID); //Database query happens here for each report


//The whole studies table is loaded into memory which means you only do one DB query and the actual firstordefault stuff is done in memory which is quicker than going over the network
private IDTOStudy GetMatchingStudy(Report report, IList<IDTOStudy> studies)
{
    foreach (var study in studies)
    if (study.StudyID == report.StudyID)
        return study;

    return null;
}

What this means is that in the second example you have optimized for database roundtrips (which is a good idea).

You can prove this theory by checking out the database queries happening behind the scenes with something like SQL Profiler

undefined
  • 33,537
  • 22
  • 129
  • 198
  • 4
    One note regarding such optimization - if you have 1 million studies in database, then loading all into memory is not good optimization – Sergey Berezovskiy Apr 18 '13 at 20:38
  • The DTO study list has been populated with a SQL query and then already been forced to List with ToList(). So I don't think I'm pounding the DB -- which was my initial thought as well. And to clarify, there's about 20,000 studies in the list. – Chris Apr 18 '13 at 20:40
  • 3
    @lazyberezovsky yeah absolutally, an even better solution to this problem (which would scale nicely) would be to do this entire thing database side not application side – undefined Apr 18 '13 at 20:41