I need to speed up the following Linq query

Question

I have an old stored proc I am rewriting into an EF Linq query however the proc is almost 3 times faster!

here is an example of the the query syntax:

public string GetStringByID(long ID)
    {
        return dataContext.Table2.FirstOrDefault(x => x.Table2ID == ID).Table1.StringValue;
    }

here is the sproc code I am using along with the method of calling it.

sproc is:

PROCEDURE [dbo].[MyQuickerProc]
@ID bigint
AS
BEGIN
SET NOCOUNT ON;

IF EXISTS(SELECT TOP 1 ID FROM Table2 WHERE Table2ID = @Id)
    BEGIN
        SELECT TOP 1 t1.StringValue
        FROM Table2  t2
            INNER JOIN Table1 t1 ON t1.Table1ID= Table2.Table1ID
        WHERE Table2ID = @ID
    END
ELSE
    BEGIN
        SELECT TOP 1 t1.StringValue
        FROM Table2 t2
            INNER JOIN Table1 t1 ON t1.Table1Id = Table2.Table1ID
        WHERE Table2ID IS NULL
    END

END

I call the proc like this:

string myString = context.MyQuickerProc(127).FirstOrDefault();

I have used unit test and stop watch to discover that the Linq call takes 1.3 seconds and the sproc call takes 0.5 seconds, shockingly long! I am investigating missing FK as we speak as I can only assume that is the reason these calls are taking so long.

In any case I need to speed up this Linq query and add the missing functionality that the sproc has and the current Linq query does not contain (the if/else logic).

Any help on this would be much appreciated. thanks in advance :)

Why rewrite if the proc is faster? You can call procs in entity framework. — HLGEM, May 13 '14 at 15:31
HLGEM - I want to write more Linq going forward in this project as I find it much quicker to develop, read and maintain. So I want this proc rewritten and sped up partly as practice for the project going forward and partly to bring uniformity across the project — Humble Rumble, May 13 '14 at 15:34
@LukeRumbelow - Coding for maintenance is a good thing, in general, but remember that there are times when encapsulation (the proc direction in this example) is a better pattern. — Gregory A Beamer, May 13 '14 at 15:45
@LukeRumbelow as of Entity Framework 5.0 (if using .Net 4.5) EF has a compiled query cache feature which is activated upon the first execution of your query. Make sure you benchmark execution of the query the second or third time it is executed in any given AppDomain's lifetime. — Federico Berasategui, May 13 '14 at 15:48
@LukeRumbelow: Can you post your profiling code? It's *entirely possible* that your stored procedure is faster than the EF query, but it's very important that you measure correctly. — Iain Galloway, May 13 '14 at 15:59
@HighCore thanks for the 'useful' comment about caching in EF but unfortunately I also have legacy code making changes to the same DB not using EF so I have turned caching off as it is not trustworthy until all legacy code has been rewritten into my new EF layer and then I will enable caching again. Good point to make though :) — Humble Rumble, May 13 '14 at 16:08
@IainGalloway - Unfortaunaltely I cannot post my profiling code today as I will have to alter class names etc... I will try to post in the morning :) — Humble Rumble, May 13 '14 at 16:09
@LukeRumbelow: I don't think HighCore is referring to data caching, he's referring to caching the compilation of the expression tree into SQL. This is an overhead of EF that you pay once per appdomain. If your profiling code runs each query once, then this will result in you having incorrect numbers. — Iain Galloway, May 13 '14 at 16:10
@IainGalloway: I see, I will amend my profiling code and post the results accordingly, probably in the morning :) — Humble Rumble, May 13 '14 at 16:14

Iain Galloway · Accepted Answer · 2014-05-14T10:03:56.270

Step 1: Establish a business case

The first thing we need to do is ask "How fast does it need to be?", because if we don't know how fast it needs to be we can't know when we're done. This isn't a technical decision, it's a business one. You need a stakeholder-centric measure of "Fast Enough" to aim for, and you need to bear in mind that Fast Enough is fast enough. We aren't looking for "As Fast As Possible" unless there's a business reason for it. Even then, we're normally looking for "As Fast As Possible Within Budget".

Since you're my stakeholder, and you don't seem to be too upset about the performance of your stored procedure, let's use that as a benchmark!

Step 2: Measure

The next thing we need to do is measure our system to see if we're Fast Enough.

Thankfully you've already measured (though we'll talk more about this later). Your stored procedure runs in 0.5 seconds! Is that Fast Enough? Yes it is! Job done!

There is no justification for continuing to spend your time (and your boss' money) fixing something that isn't broken. You probably have something better to be doing, so go do that! :D

Still here? Ok then. I'm not on the clock, people are badmouthing tech I like, and optimising Entity Framework queries is fun. Challenge Accepted!

Step 3: Inspect

So what's going on? Why is our query so slow?

To answer that question, I'm going to need to make some assumptions about your model:-

public class Foo
{
    public int Id { get; set; }

    public int BarId { get; set; }

    public virtual Bar Bar { get; set; }
}

public class Bar
{
    public int Id { get; set; }

    public string Value { get; set; }

    public virtual ICollection<Foo> Foos { get; set; }
}

Now that we've done that, we can have a look at the horrible query that Entity Framework is making for us:-

using (var context = new FooContext())
{
    context.Database.Log = s => Console.WriteLine(s);

    var query = context.Foos.FirstOrDefault(x => x.Id == 1).Bar.Value;
}

I can see from the log that TWO queries are being run:-

SELECT TOP (1)
[Extent1].[Id] AS [Id],
[Extent1].[BarId] AS [BarId]
FROM [dbo].[Foos] AS [Extent1]
WHERE 1 = [Extent1].[Id]

SELECT
[Extent1].[Id] AS [Id],
[Extent1].[Value] AS [Value]
FROM [dbo].[Bars] AS [Extent1]
WHERE [Extent1].[Id] = @EntityKeyValue1

Wait, what? Why is stupid Entity Framework making two round-trips to the database when all we need is one string?

Step 4: Analyze

Let's take a step back and look at our query again:-

var query = context.Foos.FirstOrDefault(x => x.Id == 1).Bar.Value;

Given what we know about Deferred Execution what can we deduce is going on here?

What deferred execution basically means is that as long as you're working with an IQueryable, nothing actually happens - the query is built up in memory and not actually executed until later. This is useful for a number of reasons - in particular it lets us build up our queries in a modular fashion then run the composed query once. Entity Framework would be pretty useless if context.Foos loaded the entire Foo table into memory immediately!

Our queries only get run when we ask for something other than an IQueryable, e.g. with .AsEnumerable(), .ToList(), or especially .GetEnumerator() etc. In this case .FirstOrDefault() doesn't return an IQueryable, so this triggers the database call much earlier than we presumably intended.

The query we've made is basically saying:-

Get the first Foo with Id == 1 (or null if there aren't any)
Now Lazy Load that Foo's Bar
Now tell me that Bar's Value

Wow! So not only are we making two round-trips to the database, we're also sending the entire Foo and Bar across the wire! That's not so bad when our entities are tiny like the contrived ones here, but what if they were larger realistic ones?

Step 5: Optimize

As you've hopefully gleaned from the above, the first two rules of optimisation are 1) "Don't" and 2) "Measure first" The third rule of optimisation is "Avoid unnecessary work". An extra round-trip and a whole bunch of spurious data definitely counts as "unnecessary", so let's do something about that:-

Attempt 1

The first thing we want to do is try the declarative approach. "Find me the value of the first Bar that has a Foo with Id == 1".

This is usually the clearest option from a maintainability point of view; the intent of the programmer is obviously captured. However, remembering that we want to delay execution as long as possible, let's pop the .FirstOrDefault() after the .Select():-

var query = context.Bars.Where(x => x.Foos.Any(y => y.Id == 1))
                        .Select(x => x.Value)
                        .FirstOrDefault();

SELECT TOP (1)
[Extent1].[Value] AS [Value]
FROM [dbo].[Bars] AS [Extent1]
WHERE  EXISTS (SELECT
    1 AS [C1]
    FROM [dbo].[Foos] AS [Extent2]
    WHERE ([Extent1].[Id] = [Extent2].[BarId]) AND (1 = [Extent2].[Id])
)

Attempt 2

In both SQL and most O/RMs, a useful trick is to make sure you're querying from the correct "end" of any given relationship. Sure, we're looking for a Bar, but we've got the Id of a Foo, so we can rewrite the query with that as a starting point: "Find me the Value of the Bar of the Foo with Id == 1":-

var query = context.Foos.Where(x => x.Id == 1)
                        .Select(x => x.Bar.Value)
                        .FirstOrDefault();

SELECT TOP (1)
[Extent2].[Value] AS [Value]
FROM  [dbo].[Foos] AS [Extent1]
INNER JOIN [dbo].[Bars] AS [Extent2] ON [Extent1].[BarId] = [Extent2].[Id]
WHERE 1 = [Extent1].[Id]

Much better. Prima Facie these look preferable to both the original Entity-Framework-generated mess and the original stored procedure. Done!

Step 6: Measure

No! Just wait a minute! How do we know if we're Fast Enough? How do we even know if we're faster?

We measure!

And unfortunately you'll have to do this bit on your own. I can tell you that on my machine, on my network, simulating a realistic load for my application, the INNER JOIN is the fastest, followed by the two round-trips version (!!), followed by the WHERE EXISTS version, followed by the stored procedure. I can't tell you which will be fastest on your hardware, on your network, under a realistic load for your application.

I can tell you that I've made this exact performance optimization over a dozen times and depending on the characteristics of the network, database server, and schema I've seen all three of INNER JOIN, WHERE EXISTS, and two round-trips give the best performance.

However, I can't even tell you if any of these are Fast Enough. Depending on your needs you might need to hand-roll some hyper-optimized SQL and invoke a stored procedure. You might even need to go further and use a denormalised read-optimized read store. What about using an in-memory cache for your database results? What about using an output cache for your webserver? What if this query isn't even the bottleneck?

Good performance isn't about speeding up Entity Framework queries. Good performance, like just about anything in our industry, is about knowing what's important to your customer, and figuring out the best way to get it.

Thankyou for this post, it was very interesting. I particularly like the way you extract the executed sql using database.Log. Unfortunately, all of the optimization attempts stated came in at around 1.3 seconds. My alternate option was always, as you rightly said, caching the results in my cache container and that is the route I am going down now (as the complete result set is fairly small and needs to be accessed very regularly). Thankyou for your help on this it is most appreciated. — Humble Rumble, May 14 '14 at 10:10
*Not taking my own advice*: Are you running the queries multiple times in your benchmark? Running a thing once and timing it with `Stopwatch` isn't a realistic test. In particular, as mentioned in the comments, EF has to compile the expression trees to SQL once per appdomain and this causes overhead. It's entirely possible that the stored procedure is genuinely faster, but something looks fishy to me; the actual queries are running in <10ms on my machine. You could post your profiling code if you're interested, but it's not terribly important either way so long as you're Fast Enough. — Iain Galloway, May 14 '14 at 10:20
long time no speak iain - nice breakdown of the problem and a really informative stepped analysis - gold star :) — jim tollan, Nov 18 '15 at 09:44

score 0 · Answer 2 · edited May 23 '17 at 12:16

The first thing I would recommend doing is calling ToString() on your linq queries to see the SQL being generated. Based on your query and your configuration, it is possible you are making two trips to the database, Once to get Table2, then again to get the associated Table1 entity via lazy loading. You should try to verify if this is the case either with SQL profiler or stepping through with the debugger. See if rewriting your query like the following adds any performance enhancement which eagerly loads the related entity:

var result = dataContext.Table2.
             .include("Table1")
             .FirstOrDefault(x => x.Table2ID == ID);

if(result != null){
    return result..Table1.StringValue;
}else{....}

Notice I also added in some logic checking if result is null. You are using FirstOrDefault, which will cause .Table1 to throw an exception if the result is not found. I would either change the call to First() if you never expect the result to be null, or handle the null case.

Another thing you should look at is how EF is configured to match against a NULL case, that could slow down your query. Check out this post (not to link to my own post, but its relevant): EntityFramework LINQToEntities generate weird slow TSQL Where-Clause

I will try this code in the morning, profile it and let you know how it goes, thanks. — Humble Rumble, May 13 '14 at 16:12

Daniel Brückner · Answer 3 · 2014-05-13T17:07:31.193

This should yield the correct result but I can not tell how efficient it is; you will have to profile it. Note that the query will really only fetch a single string from the database and not require any client side processing by the Entity Framework.

dataContext.Table2
           .Where(x => (x.Table2ID == ID) || (x.Table2ID == null))
           .OrderByDescending(x => x.Table2ID) // This will place ID before NULL.
           .Select(x => x.Table1.StringValue)
           .First()

Using LINQPad I got more or less the expected SQL statement but I did not try if the Entity Framework will produce the same query. But because this is a single query there is even a slight chance that the Entity Framework can outperform the stored procedure with its conditional second query but obviously only because of the reformulated query.

 SELECT TOP (1) [t1].[StringValue]
           FROM [Table2] AS [t2]
LEFT OUTER JOIN [Table1] AS [t1]
             ON [t1].[Table1ID] = [t2].[Table1ID]
          WHERE ([t2].[Table2ID] = @ID) OR ([t2].[Table2ID] IS NULL)
       ORDER BY [t2].[Table2ID] DESC

I will try this code in the morning, profile it and let you know how it goes, thanks. — Humble Rumble, May 13 '14 at 16:12