Step 1: Establish a business case
The first thing we need to do is ask "How fast does it need to be?", because if we don't know how fast it needs to be we can't know when we're done. This isn't a technical decision, it's a business one. You need a stakeholder-centric measure of "Fast Enough" to aim for, and you need to bear in mind that Fast Enough is fast enough. We aren't looking for "As Fast As Possible" unless there's a business reason for it. Even then, we're normally looking for "As Fast As Possible Within Budget".
Since you're my stakeholder, and you don't seem to be too upset about the performance of your stored procedure, let's use that as a benchmark!
Step 2: Measure
The next thing we need to do is measure our system to see if we're Fast Enough.
Thankfully you've already measured (though we'll talk more about this later). Your stored procedure runs in 0.5 seconds! Is that Fast Enough? Yes it is! Job done!
There is no justification for continuing to spend your time (and your boss' money) fixing something that isn't broken. You probably have something better to be doing, so go do that! :D
Still here? Ok then. I'm not on the clock, people are badmouthing tech I like, and optimising Entity Framework queries is fun. Challenge Accepted!
Step 3: Inspect
So what's going on? Why is our query so slow?
To answer that question, I'm going to need to make some assumptions about your model:-
public class Foo
{
public int Id { get; set; }
public int BarId { get; set; }
public virtual Bar Bar { get; set; }
}
public class Bar
{
public int Id { get; set; }
public string Value { get; set; }
public virtual ICollection<Foo> Foos { get; set; }
}
Now that we've done that, we can have a look at the horrible query that Entity Framework is making for us:-
using (var context = new FooContext())
{
context.Database.Log = s => Console.WriteLine(s);
var query = context.Foos.FirstOrDefault(x => x.Id == 1).Bar.Value;
}
I can see from the log that TWO queries are being run:-
SELECT TOP (1)
[Extent1].[Id] AS [Id],
[Extent1].[BarId] AS [BarId]
FROM [dbo].[Foos] AS [Extent1]
WHERE 1 = [Extent1].[Id]
SELECT
[Extent1].[Id] AS [Id],
[Extent1].[Value] AS [Value]
FROM [dbo].[Bars] AS [Extent1]
WHERE [Extent1].[Id] = @EntityKeyValue1
Wait, what? Why is stupid Entity Framework making two round-trips to the database when all we need is one string?
Step 4: Analyze
Let's take a step back and look at our query again:-
var query = context.Foos.FirstOrDefault(x => x.Id == 1).Bar.Value;
Given what we know about Deferred Execution what can we deduce is going on here?
What deferred execution basically means is that as long as you're working with an IQueryable
, nothing actually happens - the query is built up in memory and not actually executed until later. This is useful for a number of reasons - in particular it lets us build up our queries in a modular fashion then run the composed query once. Entity Framework would be pretty useless if context.Foos
loaded the entire Foo
table into memory immediately!
Our queries only get run when we ask for something other than an IQueryable
, e.g. with .AsEnumerable()
, .ToList()
, or especially .GetEnumerator()
etc. In this case .FirstOrDefault()
doesn't return an IQueryable
, so this triggers the database call much earlier than we presumably intended.
The query we've made is basically saying:-
- Get the first
Foo
with Id == 1
(or null
if there aren't any)
- Now Lazy Load that
Foo
's Bar
- Now tell me that
Bar
's Value
Wow! So not only are we making two round-trips to the database, we're also sending the entire Foo
and Bar
across the wire! That's not so bad when our entities are tiny like the contrived ones here, but what if they were larger realistic ones?
Step 5: Optimize
As you've hopefully gleaned from the above, the first two rules of optimisation are 1) "Don't" and 2) "Measure first" The third rule of optimisation is "Avoid unnecessary work". An extra round-trip and a whole bunch of spurious data definitely counts as "unnecessary", so let's do something about that:-
Attempt 1
The first thing we want to do is try the declarative approach. "Find me the value of the first Bar
that has a Foo
with Id == 1
".
This is usually the clearest option from a maintainability point of view; the intent of the programmer is obviously captured. However, remembering that we want to delay execution as long as possible, let's pop the .FirstOrDefault()
after the .Select()
:-
var query = context.Bars.Where(x => x.Foos.Any(y => y.Id == 1))
.Select(x => x.Value)
.FirstOrDefault();
SELECT TOP (1)
[Extent1].[Value] AS [Value]
FROM [dbo].[Bars] AS [Extent1]
WHERE EXISTS (SELECT
1 AS [C1]
FROM [dbo].[Foos] AS [Extent2]
WHERE ([Extent1].[Id] = [Extent2].[BarId]) AND (1 = [Extent2].[Id])
)
Attempt 2
In both SQL and most O/RMs, a useful trick is to make sure you're querying from the correct "end" of any given relationship. Sure, we're looking for a Bar
, but we've got the Id
of a Foo
, so we can rewrite the query with that as a starting point: "Find me the Value
of the Bar
of the Foo
with Id == 1
":-
var query = context.Foos.Where(x => x.Id == 1)
.Select(x => x.Bar.Value)
.FirstOrDefault();
SELECT TOP (1)
[Extent2].[Value] AS [Value]
FROM [dbo].[Foos] AS [Extent1]
INNER JOIN [dbo].[Bars] AS [Extent2] ON [Extent1].[BarId] = [Extent2].[Id]
WHERE 1 = [Extent1].[Id]
Much better. Prima Facie these look preferable to both the original Entity-Framework-generated mess and the original stored procedure. Done!
Step 6: Measure
No! Just wait a minute! How do we know if we're Fast Enough? How do we even know if we're faster?
We measure!
And unfortunately you'll have to do this bit on your own. I can tell you that on my machine, on my network, simulating a realistic load for my application, the INNER JOIN
is the fastest, followed by the two round-trips version (!!), followed by the WHERE EXISTS
version, followed by the stored procedure. I can't tell you which will be fastest on your hardware, on your network, under a realistic load for your application.
I can tell you that I've made this exact performance optimization over a dozen times and depending on the characteristics of the network, database server, and schema I've seen all three of INNER JOIN
, WHERE EXISTS
, and two round-trips give the best performance.
However, I can't even tell you if any of these are Fast Enough. Depending on your needs you might need to hand-roll some hyper-optimized SQL and invoke a stored procedure. You might even need to go further and use a denormalised read-optimized read store. What about using an in-memory cache for your database results? What about using an output cache for your webserver? What if this query isn't even the bottleneck?
Good performance isn't about speeding up Entity Framework queries. Good performance, like just about anything in our industry, is about knowing what's important to your customer, and figuring out the best way to get it.