63

I have been debugging some slow code and it seems that the culprit is the EF code posted below. It takes 4-5 seconds when the query is evaluated at a later stage. I'm trying to get it to run in under 1 second.

I have tested this using the SQL Server Profiler, and it seems that a bunch of SQL scripts are executed. It also confirms that it takes 3-4 seconds before SQL server is done with the executions.

I have read other similar questions about the use of Include() and it does seem that there is a performance penalty when using it. I've tried to split the below code into several different queries but it's not making much of difference.

Any idea how I can get the below to execute faster?

Currently the web app I'm working on is just showing an empty iframe while waiting for the below to complete. If I cannot get faster execution time I have to split it up and partially load the iframe with data or go with another asynchronous solution. Any ideas here would also be appreciated!

using (var scope = new TransactionScope(TransactionScopeOption.Required, new TransactionOptions { IsolationLevel = System.Transactions.IsolationLevel.ReadUncommitted }))
        {
            formInstance = context.FormInstanceSet
                                .Includes(x => x.Include(fi => fi.FormDefinition).Include(fd => fd.FormSectionDefinitions).Include(fs => fs.FormStateDefinitionEditableSections))
                                .Includes(x => x.Include(fi => fi.FormDefinition).Include(fd => fd.FormStateDefinitions))
                                .Includes(x => x.Include(fi => fi.FormSectionInstances).Include(fs => fs.FormFieldInstances).Include(ff => ff.FormFieldDefinition).Include(ffd => ffd.FormFieldMetaDataDefinition).Include(ffmdd => ffmdd.ComplexTypePropertyNames))
                                .Include(x => x.CurrentFormStateInstance)      
                                .Include(x => x.Files)
                                .FirstOrDefault(x => x.FormInstanceIdentifier == formInstanceIdentifier);

            scope.Complete();
        }
jarlh
  • 42,561
  • 8
  • 45
  • 63
Force444
  • 3,321
  • 9
  • 39
  • 77
  • 1
    What is Includes(with ‘s’)? Is it some proprietary extension? – Michael Freidgeim Feb 16 '18 at 21:38
  • @MichaelFreidgeim I believe it was an extension method to chain include on include if that makes sense. I don't have access to the code anymore so I can't tell you exactly what it was but that is my best guess from what I remember. – Force444 Feb 17 '18 at 22:16
  • Instead of nested `Include`, you can use `ThenInclude` method to make code more clean. – Onurkan Bakırcı Mar 12 '23 at 11:18

5 Answers5

112

tl;dr Multiple Includes blow up the SQL result set. Soon it becomes cheaper to load data by multiple database calls instead of running one mega statement. Try to find the best mixture of Include and Load statements.

it does seem that there is a performance penalty when using Include

That's an understatement! Multiple Includes quickly blow up the SQL query result both in width and in length. Why is that?

Growth factor of Includes

(This part applies Entity Framework classic, v6 and earlier)

Let's say we have

  • root entity Root
  • parent entity Root.Parent
  • child entities Root.Children1 and Root.Children2
  • a LINQ statement Root.Include("Parent").Include("Children1").Include("Children2")

This builds a SQL statement that has the following structure:

SELECT *, <PseudoColumns>
FROM Root
JOIN Parent
JOIN Children1

UNION

SELECT *, <PseudoColumns>
FROM Root
JOIN Parent
JOIN Children2

These <PseudoColumns> consist of expressions like CAST(NULL AS int) AS [C2], and they serve to have the same amount of columns in all UNION-ed queries. The first part adds pseudo columns for Child2, the second part adds pseudo columns for Child1.

This is what it means for the size of the SQL result set:

  • Number of columns in the SELECT clause is the sum of all columns in the four tables
  • The number of rows is the sum of records in included child collections

Since the total number of data points is columns * rows, each additional Include exponentially increases the total number of data points in the result set. Let me demonstrate that by taking Root again, now with an additional Children3 collection. If all tables have 5 columns and 100 rows, we get:

One Include (Root + 1 child collection): 10 columns * 100 rows = 1000 data points.
Two Includes (Root + 2 child collections): 15 columns * 200 rows = 3000 data points.
Three Includes (Root + 3 child collections): 20 columns * 300 rows = 6000 data points.

With 12 Includes this would amount to 78000 data points!

Conversely, if you get all records for each table separately instead of 12 Includes, you have 13 * 5 * 100 data points: 6500, less than 10%!

Now these numbers are somewhat exaggerated in that many of these data points will be null, so they don't contribute much to the actual size of the result set that is sent to the client. But the query size and the task for the query optimizer certainly get affected negatively by increasing numbers of Includes.

Balance

So using Includes is a delicate balance between the cost of database calls and data volume. It's hard to give a rule of the thumb, but by now you can imagine that the data volume generally quickly outgrows the cost of extra calls if there are more than ~3 Includes for child collections (but quite a bit more for parent Includes, that only widen the result set).

Alternative

The alternative to Include is to load data in separate queries:

context.Configuration.LazyLoadingEnabled = false;
var rootId = 1;
context.Children1.Where(c => c.RootId == rootId).Load();
context.Children2.Where(c => c.RootId == rootId).Load();
return context.Roots.Find(rootId);

This loads all required data into the context's cache. During this process, EF executes relationship fixup by which it auto-populates navigation properties (Root.Children etc.) by loaded entities. The end result is identical to the statement with Includes, except for one important difference: the child collections are not marked as loaded in the entity state manager, so EF will try to trigger lazy loading if you access them. That's why it's important to turn off lazy loading.

In reality, you will have to figure out which combination of Include and Load statements work best for you.

Other aspects to consider

Each Include also increases query complexity, so the database's query optimizer will have to make increasingly more effort to find the best query plan. At some point this may no longer succeed. Also, when some vital indexes are missing (esp. on foreign keys) performance may suffer by adding Includes, even with the best query plan.

Entity Framework core

Cartesian explosion

For some reason, the behavior described above, UNIONed queries, was abandoned as of EF core 3. It now builds one query with joins. When the query is "star" shaped1 this leads to Cartesian explosion (in the SQL result set). I can only find a note announcing this breaking change, but it doesn't say why.

Split queries

To counter this Cartesian explosion, Entity Framework core 5 introduced the concept of split queries that enables loading related data in multiple queries. It prevents building one massive, multiplied SQL result set. Also, because of lower query complexity, it may reduce the time it takes to fetch data even with multiple roundtrips. However, it may lead to inconsistent data when concurrent updates occur.


1Multiple 1:n relationships off of the query root.

Gert Arnold
  • 105,341
  • 31
  • 202
  • 291
  • Question about EF Core, is the cartesian explosion worse performance wise? I mean, is the rule of thumb of 3 includes before splitting about the same for EF Core from your experience? – Nicolas Belley Sep 15 '21 at 13:27
  • @NicolasBelley Depends very much on the data. Two includes of large child collections already have the potential to explode massively. Imagine they're both 10,000 items... – Gert Arnold Sep 15 '21 at 13:36
  • .AsSplitQuery() did the trick for me. – Sanjeev Rai Jul 27 '22 at 18:20
5

with dot net core 5 I have use this solution

_context.ChangeTracker.LazyLoadingEnabled = false;
_context.ChangeTracker.AutoDetectChangesEnabled = false;
var mainObj = _context.MarinzonServiceItems.Where(filter);
var returnQuery = mainObj.Include(x => x.Service);
returnQuery.Include(x => x.User).Load();
returnQuery.Include(x => x.Category).Load();
returnQuery.Include(x => x.FAQQuestions).Load();
returnQuery.Include(x => x.FAQServices).Load();
returnQuery.Include(x => x.ServiceItemServices.Where(x => x.IsActive == true)).ThenInclude(x => x.ServiceItemServicePrices).Load();
return returnQuery;
Ahsan Ismail
  • 51
  • 1
  • 3
  • Can you elaborate on this answer please? What is the resulting SQL in comparison to the regular context.object.Include().Include().Include() for example. What are the speed gains you found? Isn't .AsNoTracking() the same as turning off .AutoDetectChangesEnabled ? – Kevin Dark May 26 '22 at 08:57
  • @KevinDark I am also wondering. I am trying to use ThenInclude, but it is really slow. Can anyone redirect me to any post with best practice. – Micheal Choudhary May 26 '22 at 11:49
1

I've got a similar issue with a query that had 15+ "Include" statements and generated a 2M+ rows result in 7 minutes.

The solution that worked for me was:

  1. Disabled lazy loading
  2. Disabled auto detect changes
  3. Split the big query in small chunks

A sample can be found below:

public IQueryable<CustomObject> PerformQuery(int id) 
{
 ctx.Configuration.LazyLoadingEnabled = false;
 ctx.Configuration.AutoDetectChangesEnabled = false;

 IQueryable<CustomObject> customObjectQueryable = ctx.CustomObjects.Where(x => x.Id == id);

 var selectQuery = customObjectQueryable.Select(x => x.YourObject)
                                                  .Include(c => c.YourFirstCollection)
                                                  .Include(c => c.YourFirstCollection.OtherCollection)
                                                  .Include(c => c.YourSecondCollection);

 var otherObjects = customObjectQueryable.SelectMany(x => x.OtherObjects);

 selectQuery.FirstOrDefault();
 otherObjects.ToList();

 return customObjectQueryable;
 }

IQueryable is needed in order to do all the filtering at the server side. IEnumerable would perform the filtering in memory and this is a very time consuming process. Entity Framework will fix up any associations in memory.

Andrei Petrut
  • 390
  • 2
  • 15
0

Do you have correctly configured relationships between all the entities that you try to 'include'? If at least one entity does not have a relationship to some of other entities, then EF will not able to construct one complex query using SQL join syntax - instead it will execute as many queries as many 'includes' you have. And of course, that will lead to performance issues. Could you please post the exact query(-es) that EF generates in order to get the data?

drcolombo
  • 194
  • 6
  • There are about 20-30! Does this indicate a problem with the query? – Force444 Jan 11 '16 at 14:50
  • I believe, that means you don't have proper foreign keys in the database or not correctly configured relations during your DbContext creation in case of Model-First approach (OnModelCreating method). Ideally, if you have all the relations properly configured, you will have the only query with a lot of joins. So, that's not a problem with a query, rather than with the database structure. – drcolombo Jan 11 '16 at 15:01
  • I have noticed that foreign keys are not defined at all between some of the tables. So it has a big impact on the actual SQL that is generated? – Force444 Jan 11 '16 at 15:17
  • Is there anywhere I can read about this? I.e. missing foreign keys leading to excessive generated sql from by EF ? BTW the webapp has been made using DB first. – Force444 Jan 11 '16 at 15:22
  • Try to read [this](http://www.entityframeworktutorial.net/entity-relationships.aspx). In case of DatabaseFirst, having proper foreign keys is extremely important. Otherwise EF will not able to get the relations between entities and, as the result, will not able to build a proper query. Example: you have User and Group tables without a FK between them. Then, using Include statement in C#, how will EF know that say, UserA is a member of Group1? It will just generate two queries: one to get all the users, another one to get all the groups... – drcolombo Jan 11 '16 at 15:33
  • Another suggestion (as you said you don't have FKs between some of the tables but that means you have them between others) - try to use Include() only for tables that have FKs between them. In this case EF should generate only one query. If that doesn't help also - try to fix the model in your edmx adding relations there. – drcolombo Jan 11 '16 at 15:35
  • It seems that in the edmx the relations are defined correctly, but looking at the DB itself i.e. in SQL Management Studio, DB -> table -> keys there are no foreign keys defined for some tables but for others there are... Can this inconsistency be a problem too? Or is it sufficient to have your relations defined in the edmx only ? – Force444 Jan 11 '16 at 15:44
  • Try to make a query with 2 or max 3 tables (using include statement), which have relationships defined. If in the end you'll have only one SQL query - try to add other tables one by one determing the one that causes the problems. Otherwise, review your edmx - maybe relationships are not correctly defined (I mean they don't reflect the real structure of the database). Basically, if relationships are correctly defined, the include statement will generate only one SQL query with appropriate joins. – drcolombo Jan 13 '16 at 13:19
0
  1. Get rid of that distributed transaction at the start of your query. It's literally killing your application by promoting a simple transaction to a distributed transaction.
  2. Using ReadUncommitted transaction scopes makes your application vulnerable to repeated reads.
  3. You need to use AsNoTracking if you're not updating these entities. Tracking creates proxies, and EF logic for creating proxies significantly slows down performance.
  4. If it's still slow, and your database is SQL Server, grab the query in SQL Profiler and run it in SSMS with SET STATISTICS TIME ON; SET STATISTICS IO ON; and paste the Output window text into statisticsparser.com
John Zabroski
  • 2,212
  • 2
  • 28
  • 54