23

I'm trying to make a stackoverflow clone in my own time to learn EF6 and MVC5, i'm currently using OWin for authentication.

Everything works fine when i have like 50-60 questions, i used Red Gate data generator and try to ramp it up to 1million questions with a couple of thousands of child table rows without relationship just to 'stress' the ORM a bit. Here's how the linq looks like

var query = ctx.Questions
               .AsNoTracking()     //read-only performance boost.. http://visualstudiomagazine.com/articles/2010/06/24/five-tips-linq-to-sql.aspx
               .Include("Attachments")                                
               .Include("Location")
               .Include("CreatedBy") //IdentityUser
               .Include("Tags")
               .Include("Upvotes")
               .Include("Upvotes.CreatedBy")
               .Include("Downvotes")
               .Include("Downvotes.CreatedBy")
               .AsQueryable();

if (string.IsNullOrEmpty(sort)) //default
{
    query = query.OrderByDescending(x => x.CreatedDate);
}
else
{
    sort = sort.ToLower();
    if (sort == "latest")
    {
        query = query.OrderByDescending(x => x.CreatedDate);
    }
    else if (sort == "popular")
    {
        //most viewed
        query = query.OrderByDescending(x => x.ViewCount);
    }
}

var complaints = query.Skip(skipCount)
                      .Take(pageSize)
                      .ToList(); //makes an evaluation..

Needless to say i'm getting SQL timeouts and after installing Miniprofiler, and look at the sql statement generated, it's a monstrous few hundred lines long.

I know i'm joining/including too many tables, but how many projects in real life, we only have to join 1 or 2 tables? There might be situations where we have to do this many joins with multi-million rows, is going stored procedures the only way?

If that's the case, would EF itself be only suitable for small scale projects?

Lee Gary
  • 2,357
  • 2
  • 22
  • 38
  • @ken2k can you help look at the query? – Lee Gary Mar 04 '14 at 01:41
  • 5
    The size of the SQL statement is not what causes the query to be inefficient. How are the tables designed? Do you have them indexed properly? – Phil Sandler Mar 04 '14 at 02:02
  • i'm using code-first, so the indexes are created by EF – Lee Gary Mar 04 '14 at 02:06
  • 2
    Are you saying that EF automatically creates indexes for you? I'm not aware of any such feature in EF. – Phil Sandler Mar 04 '14 at 02:18
  • 1
    yes it did, i can even see it in my migrations folder – Lee Gary Mar 04 '14 at 02:26
  • I think you are using too many Include sentences, you can try removing some of those and retrieve the specific ones in a separate call, on other hand why don't you try to use .Include("Upvotes.CreatedBy") instead of .Include("Upvotes").Include("Upvotes.CreatedBy"), CreatedBy is a entity or just a property? – JMGH Mar 04 '14 at 03:05
  • @JMGH .Include("Upvotes.CreatedBy") is a virtual property which inherits ApplicationUser which in turns inherits Microsoft.AspNet.Identity.EntityFramework.IdentityUser. If i don't include them it will throw "object already disposed exception" – Lee Gary Mar 04 '14 at 03:16
  • As per @PhilSandler's advice, i'm now trying sql server's database engine tuning advisor, they suggested creating some indexes, i excluded partioning since most of the data are dummy/generated. It did help tremendously since i'm not facing sql timeout as often, but tuning sql requires specialized skillsets, or am i safe enough to get ahead and accept their advices? – Lee Gary Mar 04 '14 at 03:21
  • I think the best way to resolve this is to create an SP and call it / cast it using EF/LINQ – CallumVass Mar 04 '14 at 09:13
  • If the point of the exercise is to learn EF and MVC, you can probably just apply the DTA's advice. If this were intended to be a production system, tuning would require a bit more manual analysis. – Phil Sandler Mar 04 '14 at 12:32
  • If this is the most important query in your app and you need performance, create a stored procedure and EF will map the results to objects. That way you'll be able to optimize the sql statement as much as you want. EF is great but sometimes it needs extra help :) – Francisco Goldenstein Aug 07 '14 at 19:06

5 Answers5

20

Most likely the problem you are experiencing is a Cartesian product.

Based on just some sample data:

var query = ctx.Questions // 50 
  .Include("Attachments") // 20                                
  .Include("Location") // 10
  .Include("CreatedBy") // 5
  .Include("Tags") // 5
  .Include("Upvotes") // 5
  .Include("Upvotes.CreatedBy") // 5
  .Include("Downvotes") // 5
  .Include("Downvotes.CreatedBy") // 5

  // Where Blah
  // Order By Blah

This returns a number of rows upwards of

50 x 20 x 10 x 5 x 5 x 5 x 5 x 5 x 5 = 156,250,000

Seriously... that is an INSANE number of rows to return.

You really have two options if you are having this issue:

First: The easy way, rely on Entity-Framework to wire up models automagically as they enter the context. And afterwards, use the entities AsNoTracking() and dispose of the context.

// Continuing with the query above:

var questions = query.Select(q => q);
var attachments = query.Select(q => q.Attachments);
var locations = query.Select(q => q.Locations);

This will make a request per table, but instead of 156 MILLION rows, you only download 110 rows. But the cool part is they are all wired up in EF Context Cache memory, so now the questions variable is completely populated.

Second: Create a stored procedure that returns multiple tables and have EF materialize the classes.

New Third: EF Now support splitting queries as above, while keeping the nice .Include() methods. Split Queries do have a few gotcha's so I recommend reading all the documentation.

Example from the above link:

If a typical blog has multiple related posts, rows for these posts will duplicate the blog's information. This duplication leads to the so-called "cartesian explosion" problem.

using (var context = new BloggingContext())
{
    var blogs = context.Blogs
        .Include(blog => blog.Posts)
        .AsSplitQuery()
        .ToList();
}

It will produce the following SQL:

SELECT [b].[BlogId], [b].[OwnerId], [b].[Rating], [b].[Url]
FROM [Blogs] AS [b]
ORDER BY [b].[BlogId]

SELECT [p].[PostId], [p].[AuthorId], [p].[BlogId], [p].[Content], [p].[Rating], [p].[Title], [b].[BlogId]
FROM [Blogs] AS [b]
INNER JOIN [Post] AS [p] ON [b].[BlogId] = [p].[BlogId]
ORDER BY [b].[BlogId]
Erik Philips
  • 53,428
  • 11
  • 128
  • 150
  • I still don't understand, it's only return 50 rows at most and just join about 9 tables. Why should it return somethings like 156,250,000 as you said ? And, in the OP's code, it do skip take too, never return the whole table. – hakuna1811 May 04 '16 at 07:50
  • 5
    This answer is a bit off - a Cartesian product would be a cross join of all the related tables, but the associations will be loaded here using inner joins, which means the maximum number of rows returned is the set with the greatest number of rows after intersection with related sets. However, the resulting query will contain repeated columns and data for every joined entity, which in the above query is a MASSIVE amount of redundant data. It's best to load the associations as separate queries beyond 1-2 includes. – Sam May 12 '16 at 12:10
  • I've been looking for the best optimization for a while, but this example (query.Select) + explanation helped me those most by far – Daniël Camps Nov 03 '17 at 12:12
  • @Sam Actually a Cartesian Product has nothing to do with tables. *Cartesian product is a mathematical operation that returns a set (or product set or simply product) from multiple sets. That is, for sets A and B, the Cartesian product A × B.* In this case, my answer is spot on as the number of rows returned (Cartesian Product) is the number of rows returned from each table multiplied (A x B x C etc). – Erik Philips Nov 03 '17 at 19:20
  • With multiple `Include`s, EF builds a query with `UNION`s, so the number of records doesn't *multiply* but *add*, see [here](https://stackoverflow.com/a/34732579/861716). Still, `Include` should be used with care. – Gert Arnold Feb 22 '19 at 12:42
  • @GertArnold [in testing, I cannot get EF to use a UNION](https://dotnetfiddle.net/FL6EkF). – Erik Philips Feb 22 '19 at 17:00
  • 1
    It happens when there are two Includes of independent collections, not nested collections. For example, `.Include("Attachments").Include("Tags")` would be translated into one `UNION`. Nested collections don't cause a Cartesian product. – Gert Arnold Feb 22 '19 at 19:43
  • @GertArnold Ah yes you're right. The `.Include(....Select(..))` causes the Cartesian product. Will have to update the answer thanks. – Erik Philips Feb 22 '19 at 20:22
11

I don't see anything obviously wrong with your LINQ query (.AsQueryable() shouldn't be mandatory, but it won't change anything if you remove it). Of course, don't include unnecessary navigation properties (each one adds a SQL JOIN), but if everything is required, it should be OK.

Now as the C# code looks OK, it's time to see the generated SQL code. As you already did, the first step is to retrieve the SQL query that is executed. There are .Net ways of doing it, for SQL Server I personally always starts a SQL Server profiling session.

Once you have the SQL query, try to execute it directly against your database, and don't forget to include the actual execution plan. This will show you exactly which part of your query takes the majority of the time. It will even indicate you if there are obvious missing indexes.

Now the question is, should you add all these indexes your SQL Server tells you they are missing? Not necessarily. See for example Don't just blindly create those missing indexes. You'll have to choose which indexes should be added, which shouldn't.

As code-first approach created indexes for you, I'm assuming those are indexes on the primary and foreign keys only. That's a good start, but that's not enough. I don't known about the number of rows in your tables, but an obvious index that only you can add (no code-generation tool can do that because it's related to your business queries), is for example an index on the CreatedDate column, as you're ordering your items by this value. If you don't, SQL Server will have to execute a table scan on 1M rows, which will of course be disastrous in terms of performances.

So :

  • try to remove some Include if you can
  • look at the actual execution plan to see where is the performance issue in your query
  • add only the missing indexes that make sense, depending on how you're ordering/filtering the data you're getting from the DB
Community
  • 1
  • 1
ken2k
  • 48,145
  • 10
  • 116
  • 176
  • I've done exactly what you said, and i've done an extra step of creating a new viewmodel to make the projection smaller, but that doesn't seem to help much. After looking at MiniProfiler, i've also implemented simple System.Runtime.Caching.MemoryCache on other cacheable queries. As for the execution plan, i see several clustered index seek (6%), are those normal or 'already-optimized' – Lee Gary Mar 04 '14 at 10:00
  • @LeeGary Does the SQL query require almost the same time the LINQ query does? If yes, then it's not the problem of C#/LINQ/EF, but really a problem with your database (indexes or schema). What you could try is to execute a simple `SELECT TOP 20 FROM XXX INNER JOIN...` with all the `JOIN` your query needs to see how much time it requires. – ken2k Mar 04 '14 at 10:14
  • @LeeGary Also, try to remove `.Include("Downvotes")` and `.Include("Upvotes")`, this shouldn't be required as you include a nested entity (`CreatedBy`). Hopefully EF is smart enough so it won't affect the resulting query, but it's worth trying. – ken2k Mar 04 '14 at 10:16
  • @LeeGary As for the execution plan, could you add it to the question? – ken2k Mar 04 '14 at 10:17
  • the execution plan is 3.5k lines long, any good suggestion where i can upload it to? – Lee Gary Mar 04 '14 at 14:25
7

As you already know, Include method generate monstrous SQL.

Disclaimer: I'm the owner of the project Entity Framework Plus (EF+)

The EF+ Query IncludeOptimized method allows optimizing the SQL generated exactly like EF Core does.

Instead of generating a monstrous SQL, multiple SQL are generated (one for each include). This feature also as a bonus, it allows filtering related entities.

Docs: EF+ Query IncludeOptimized

var query = ctx.Questions
               .AsNoTracking()
               .IncludeOptimized(x => x.Attachments)                                
               .IncludeOptimized(x => x.Location)
               .IncludeOptimized(x => x.CreatedBy) //IdentityUser
               .IncludeOptimized(x => x.Tags)
               .IncludeOptimized(x => x.Upvotes)
               .IncludeOptimized(x => x.Upvotes.Select(y => y.CreatedBy))
               .IncludeOptimized(x => x.Downvotes)
               .IncludeOptimized(x => x.Downvotes.Select(y => y.CreatedBy))
               .AsQueryable();
Jonathan Magnan
  • 10,874
  • 2
  • 38
  • 60
  • How does EF+ work with model that supports lazy-loading? As I've tested a query using IncludeOptimized, it still trigger lazy-loading even if I've, theoretically, loaded the data in eager-load before. Any advice? To reproduce in your example, let's say that Upvotes and Downvotes are lazy-loading (with virtual in model). – Iúri dos Anjos Apr 14 '20 at 21:04
4

Take a look at section 8.2.2 of this document from Microsoft:

8.2.2 Performance concerns with multiple Includes

When we hear performance questions that involve server response time problems, the source of the issue is frequently queries with multiple Include statements. While including related entities in a query is powerful, it's important to understand what's happening under the covers.

It takes a relatively long time for a query with multiple Include statements in it to go through our internal plan compiler to produce the store command. The majority of this time is spent trying to optimize the resulting query. The generated store command will contain an Outer Join or Union for each Include, depending on your mapping. Queries like this will bring in large connected graphs from your database in a single payload, which will acerbate any bandwidth issues, especially when there is a lot of redundancy in the payload (i.e. with multiple levels of Include to traverse associations in the one-to-many direction).

You can check for cases where your queries are returning excessively large payloads by accessing the underlying TSQL for the query by using ToTraceString and executing the store command in SQL Server Management Studio to see the payload size. In such cases you can try to reduce the number of Include statements in your query to just bring in the data you need. Or you may be able to break your query into a smaller sequence of subqueries, for example:

Before breaking the query:

using (NorthwindEntities context = new NorthwindEntities()) {
var customers = from c in context.Customers.Include(c => c.Orders)
                where c.LastName.StartsWith(lastNameParameter)
                select c;

foreach (Customer customer in customers)
{
    ...
} }

After breaking the query:

using (NorthwindEntities context = new NorthwindEntities()) {
var orders = from o in context.Orders
             where o.Customer.LastName.StartsWith(lastNameParameter)
             select o;

orders.Load();

var customers = from c in context.Customers
                where c.LastName.StartsWith(lastNameParameter)
                select c;

foreach (Customer customer in customers)
{
    ...
} }

This will work only on tracked queries, as we are making use of the ability the context has to perform identity resolution and association fixup automatically.

As with lazy loading, the tradeoff will be more queries for smaller payloads. You can also use projections of individual properties to explicitly select only the data you need from each entity, but you will not be loading entities in this case, and updates will not be supported.

adam0101
  • 29,096
  • 21
  • 96
  • 174
4

I disagree with Ken2k's answer and am surprised that it has as many upvotes as it does.

The code may be fine in the sense that it compiles, but having that many includes is definitely not OK if you care about your queries being performant. See 8.2.2 of MSFT's EF6 Performance Whitepaper:

When we hear performance questions that involve server response time problems, the source of the issue is frequently queries with multiple Include statements.

Taking a look at the TSQL that EF generates from eagerly loading that many navigation properties in one query (via the numerous .Include() statements) will make it obvious why this is no good. You're going to end up with way too many EF generated joins in one query.

Break up your query so that there are no more than 2 .Include() statements per table fetch. You can do a separate .Load() per dataset but you most likely don't need to go that far, YMMV.

var query = ctx.Questions.Where(...);
// Loads Questions, Attachments, Location tables
query.Include(q => q.Attachments)
     .Include(q => q.Location)
     .Load();

// Loads IdentityUsers Table
query.Select(q => q.CreatedBy).Load();
// Loads Tags
query.Select(q => q.Tags).Load();

// Loads Upvotes and Downvotes
query.Include(q => q.Upvotes)
     .Include(q => q.Downvotes)
     .Load();

// Assuming Upvotes.CreatedBy and Downvotes.CreatedBy are also an IdentityUser,
// then you don't need to do anything further as the IdentityUser table is loaded
// from query.Select(q => q.CreatedBy).Load(); and EF will make this association for you

Erik mentions that you can use .AsNoTracking(), and I'm not totally sure at what point he is recommending to use this but if you need to consume the resulting entity set with populated navigation properties (for example query above) you cannot use .AsNoTracking() at this invalidates the association between entities in EF's cache (once again, from 8.2.2 of MSFT's doc):

This [breaking up the EF query] will work only on tracked queries, as we are making use of the ability the context has to perform identity resolution and association fixup automatically.

For added performance if your query is read only, i.e. you are not updating values you can set the following properties on your DbContext (assuming you eagerly load all required data):

        Configuration.LazyLoadingEnabled = false;
        Configuration.AutoDetectChangesEnabled = false;
        Configuration.ProxyCreationEnabled = false;

Finally your DbContext should have a Per-Request lifetime/scope.

To Ken's point, certainly if your database architecture is a mess running profiler / viewing the execution plan can help you tweak indexes / identify other problems, but before even thinking of opening profiler break up your query limiting the number of .Includes() per .Load() and you should see a tremendous speed improvement from this alone.

Clark
  • 478
  • 6
  • 14
  • 1
    Good points! Note that disabling lazy loading (either by `LazyLoadingEnabled` or `ProxyCreationEnabled`) is always recommended in this scenario, because collections populated by relationship fixup will still trigger lazy loading when accessed, because they're not marked as loaded. – Gert Arnold Feb 23 '19 at 11:19
  • I tried breaking my query from multiple includes to 2 includes but I dont really see a difference – Fakhar Ahmad Rasul Mar 07 '20 at 14:28