81

UPDATE 3: According to this announcement, this has been addressed by the EF team in EF6 alpha 2.

UPDATE 2: I've created a suggestion to fix this problem. To vote for it, go here.

Consider a SQL database with one very simple table.

CREATE TABLE Main (Id INT PRIMARY KEY)

I populate the table with 10,000 records.

WITH Numbers AS
(
  SELECT 1 AS Id
  UNION ALL
  SELECT Id + 1 AS Id FROM Numbers WHERE Id <= 10000
)
INSERT Main (Id)
SELECT Id FROM Numbers
OPTION (MAXRECURSION 0)

I build an EF model for the table and run the following query in LINQPad (I am using "C# Statements" mode so LINQPad doesn't create a dump automatically).

var rows = 
  Main
  .ToArray();

Execution time is ~0.07 seconds. Now I add the Contains operator and re-run the query.

var ids = Main.Select(a => a.Id).ToArray();
var rows = 
  Main
  .Where (a => ids.Contains(a.Id))
  .ToArray();

Execution time for this case is 20.14 seconds (288 times slower)!

At first I suspected that the T-SQL emitted for the query was taking longer to execute, so I tried cutting and pasting it from LINQPad's SQL pane into SQL Server Management Studio.

SET NOCOUNT ON
SET STATISTICS TIME ON
SELECT 
[Extent1].[Id] AS [Id]
FROM [dbo].[Primary] AS [Extent1]
WHERE [Extent1].[Id] IN (1,2,3,4,5,6,7,8,...

And the result was

SQL Server Execution Times:
  CPU time = 0 ms,  elapsed time = 88 ms.

Next I suspected LINQPad was causing the problem, but performance is the same whether I run it in LINQPad or in a console application.

So, it appears that the problem is somewhere within Entity Framework.

Am I doing something wrong here? This is a time-critical part of my code, so is there something I can do to speed up performance?

I am using Entity Framework 4.1 and Sql Server 2008 R2.

UPDATE 1:

In the discussion below there were some questions about whether the delay occurred while EF was building the initial query or while it was parsing the data it received back. To test this I ran the following code,

var ids = Main.Select(a => a.Id).ToArray();
var rows = 
  (ObjectQuery<MainRow>)
  Main
  .Where (a => ids.Contains(a.Id));
var sql = rows.ToTraceString();

which forces EF to generate the query without executing it against the database. The result was that this code required ~20 secords to run, so it appears that almost all of the time is taken in building the initial query.

CompiledQuery to the rescue then? Not so fast ... CompiledQuery requires the parameters passed into the query to be fundamental types (int, string, float, and so on). It won't accept arrays or IEnumerable, so I can't use it for a list of Ids.

Jacob
  • 3,598
  • 4
  • 35
  • 56
Mike
  • 7,500
  • 8
  • 44
  • 62
  • 1
    Have you tried `var qry = Main.Where (a => ids.Contains(a.Id)); var rows = qry.ToArray();` to see which part of the query is taking the time? – Andrew Cooper Oct 26 '11 at 01:11
  • it is not the EF that degrades your query, it is the actual query that you are trying to run; could you explain what you are trying to do? perhaps there is a better approach to your needs – Kris Ivanov Oct 26 '11 at 01:21
  • @AndrewCooper I just tried it, and due to deferred execution the first statement (without the ToArray) executes almost instantaneously. The query, including the Contains filtering, doesn't actually run until you execute the ToArray(). – Mike Oct 26 '11 at 04:05
  • @KrisIvanov I have a database with 500,000+ records and I need to select 1,000 of them at a time based on their ids alone. I could execute 1,000 individual queries, but some approach that pulls them all back in one round trip would probably be more efficient ... at least, that's what I thought before I encountered the problem above. :) – Mike Oct 26 '11 at 04:08
  • @Mike, are those 1,000 ids random sequence or follow some consistency, for example `10 <= ids <= 500`? – Kris Ivanov Oct 26 '11 at 11:59
  • 5
    Just and update on this: EF6 alpha 2 includes an improvement that accelerates the translation of Enumerable.Contains. See the announcement here: http://blogs.msdn.com/b/adonet/archive/2012/12/10/ef6-alpha-2-available-on-nuget.aspx. My own tests show that translating list.Contains(x) for a list with 100,000 int elements now takes well under a second, and the time grows approximately linearly with the number of elements in the list. Thanks for your feedback and helping us improve EF! – divega Dec 11 '12 at 17:19
  • 1
    Beware of this... queries with any IEnumerable parameter are unable to be cached, which can cause pretty serious side effects when your query plans are complicated. If you have to run the operations a lot of times (e.g. using Contains to get chunks of data) you might have some pretty nasty query recompile times! Check the source for yourself and you can see that `parent._recompileRequired = () => true;` happens for all queries containing an IEnumerable parameter. Boo! – jocull Nov 15 '13 at 17:29

8 Answers8

69

UPDATE: With the addition of InExpression in EF6, the performance of processing Enumerable.Contains improved dramatically. The approach described in this answer is no longer necessary.

You are right that most of the time is spent processing the translation of the query. EF's provider model doesn't currently include an expression that represents an IN clause, therefore ADO.NET providers can't support IN natively. Instead, the implementation of Enumerable.Contains translates it to a tree of OR expressions, i.e. for something that in C# looks like like this:

new []{1, 2, 3, 4}.Contains(i)

... we will generate a DbExpression tree that could be represented like this:

((1 = @i) OR (2 = @i)) OR ((3 = @i) OR (4 = @i))

(The expression trees have to be balanced because if we had all the ORs over a single long spine there would be more chances that the expression visitor would hit a stack overflow (yes, we actually did hit that in our testing))

We later send a tree like this to the ADO.NET provider, which can have the ability to recognize this pattern and reduce it to the IN clause during SQL generation.

When we added support for Enumerable.Contains in EF4, we thought it was desirable to do it without having to introduce support for IN expressions in the provider model, and honestly, 10,000 is much more than the number of elements we anticipated customers would pass to Enumerable.Contains. That said, I understand that this is an annoyance and that the manipulation of expressions trees makes things too expensive in your particular scenario.

I discussed this with one of our developers and we believe that in the future we could change the implementation by adding first-class support for IN. I will make sure this is added to our backlog, but I cannot promise when it will make it given there are many other improvements we would like to make.

To the workarounds already suggested in the thread I would add the following:

Consider creating a method that balances the number of database roundtrips with the number of elements you pass to Contains. For instance, in my own testing I observed that computing and executing against a local instance of SQL Server the query with 100 elements takes 1/60 of a second. If you can write your query in such a way that executing 100 queries with 100 different sets of ids would give you equivalent result to the query with 10,000 elements, then you can get the results in aproximately 1.67 seconds instead of 18 seconds.

Different chunk sizes should work better depending on the query and the latency of the database connection. For certain queries, i.e. if the sequence passed has duplicates or if Enumerable.Contains is used in a nested condition you may obtain duplicate elements in the results.

Here is a code snippet (sorry if the code used to slice the input into chunks looks a little too complex. There are simpler ways to achieve the same thing, but I was trying to come up with a pattern that preserves streaming for the sequence and I couldn't find anything like it in LINQ, so I probably overdid that part :) ):

Usage:

var list = context.GetMainItems(ids).ToList();

Method for context or repository:

public partial class ContainsTestEntities
{
    public IEnumerable<Main> GetMainItems(IEnumerable<int> ids, int chunkSize = 100)
    {
        foreach (var chunk in ids.Chunk(chunkSize))
        {
            var q = this.MainItems.Where(a => chunk.Contains(a.Id));
            foreach (var item in q)
            {
                yield return item;
            }
        }
    }
}

Extension methods for slicing enumerable sequences:

public static class EnumerableSlicing
{

    private class Status
    {
        public bool EndOfSequence;
    }

    private static IEnumerable<T> TakeOnEnumerator<T>(IEnumerator<T> enumerator, int count, 
        Status status)
    {
        while (--count > 0 && (enumerator.MoveNext() || !(status.EndOfSequence = true)))
        {
            yield return enumerator.Current;
        }
    }

    public static IEnumerable<IEnumerable<T>> Chunk<T>(this IEnumerable<T> items, int chunkSize)
    {
        if (chunkSize < 1)
        {
            throw new ArgumentException("Chunks should not be smaller than 1 element");
        }
        var status = new Status { EndOfSequence = false };
        using (var enumerator = items.GetEnumerator())
        {
            while (!status.EndOfSequence)
            {
                yield return TakeOnEnumerator(enumerator, chunkSize, status);
            }
        }
    }
}

Hope this helps!

divega
  • 6,320
  • 1
  • 31
  • 31
  • To explain the `!(status.EndOfSequence = true)` in the TakeOnEnumerator method: So the side effect of this expression assignment will always be !true thereby not affecting the overall expression. It essentially marks the `stats.EndOfSequence` as `true` only when there are remaining items to be fetched, but you've hit the end of the enumeration. – arviman Dec 03 '14 at 07:03
  • Maybe the performance of processing `Enumerable.Contains` improved dramatically in EF 6 compared to the previous versions of EF. But, unfortunately, it's still far from satisfactory/production-ready in our use-cases. – Myk Jan 27 '19 at 18:13
  • 1
    @Nik If you are using EF Core you can try this: https://stackoverflow.com/a/70587979/2206145 – yv989c Jan 07 '22 at 00:00
24

If you find a performance problem which is blocking for you don't try to spend ages on solving it because you will most probably don't success and you will have to communicate it with MS directly (if you have premium support) and it takes ages.

Use workaround and workaround in case of performance issue and EF means direct SQL. There is nothing bad about it. Global idea that using EF = not using SQL anymore is a lie. You have SQL Server 2008 R2 so:

  • Create stored procedure accepting table valued parameter to pass your ids
  • Let your stored procedure return multiple result sets to emulate Include logic in optimal way
  • If you need some complex query building use dynamic SQL inside stored procedure
  • Use SqlDataReader to get results and construct your entities
  • Attach them to context and work with them as if they were loaded from EF

If the performance is critical for you you will not find better solution. This procedure cannot be mapped and executed by EF because current version doesn't support either table valued parameters or multiple result sets.

Ladislav Mrnka
  • 360,892
  • 59
  • 660
  • 670
  • @Laddislav Mrnka We encountered similar performance issue due to list.Contains(). We are going to try creating procedures by passing ids. Should we experience any performance hit if we run this procedure through EF? – Kurubaran Jul 28 '15 at 19:37
9

We were able to solve the EF Contains problem by adding an intermediate table and joining on that table from LINQ query that needed to use Contains clause. We were able to get amazing results with this approach. We have a large EF model and as "Contains" is not allowed when pre-compiling EF queries we were getting very poor performance for queries that use "Contains" clause.

An overview:

  • Create a table in SQL Server - for example HelperForContainsOfIntType with HelperID of Guid data-type and ReferenceID of int data-type columns. Create different tables with ReferenceID of differing data-types as needed.

  • Create an Entity / EntitySet for HelperForContainsOfIntType and other such tables in EF model. Create different Entity / EntitySet for different data-types as needed.

  • Create a helper method in .NET code which takes the input of an IEnumerable<int> and returns an Guid. This method generates a new Guid and inserts the values from IEnumerable<int> into HelperForContainsOfIntType along with the generated Guid. Next, the method returns this newly generated Guid to the caller. For fast inserting into HelperForContainsOfIntType table, create a stored-procedure which takes input of an list of values and does the insertion. See Table-Valued Parameters in SQL Server 2008 (ADO.NET). Create different helpers for different data-types or create a generic helper method to handle different data-types.

  • Create a EF compiled query which is similar to something like below:

    static Func<MyEntities, Guid, IEnumerable<Customer>> _selectCustomers =
        CompiledQuery.Compile(
            (MyEntities db, Guid containsHelperID) =>
                from cust in db.Customers
                join x in db.HelperForContainsOfIntType on cust.CustomerID equals x.ReferenceID where x.HelperID == containsHelperID
                select cust 
        );
    
  • Call the helper method with values to be used in the Contains clause and get the Guid to use in the query. For example:

    var containsHelperID = dbHelper.InsertIntoHelperForContainsOfIntType(new int[] { 1, 2, 3 });
    var result = _selectCustomers(_dbContext, containsHelperID).ToList();
    
Dhwanil Shah
  • 1,072
  • 1
  • 9
  • 25
5

I'm not familiar with Entity Framework but is the perf better if you do the following?

Instead of this:

var ids = Main.Select(a => a.Id).ToArray();
var rows = Main.Where (a => ids.Contains(a.Id)).ToArray();

how about this (assuming the ID is an int):

var ids = new HashSet<int>(Main.Select(a => a.Id));
var rows = Main.Where (a => ids.Contains(a.Id)).ToArray();
Shiv
  • 1,274
  • 1
  • 19
  • 23
  • I don't why and how but it's worked like a charm :) Thank you very much :) – Wahid Bitar Mar 10 '14 at 10:40
  • 1
    The explanation of why the performance is better is the int[].Contains call in the first call is O(n) - potentially a full array scan - whereas the HashSet.Contains call is O(1). See https://stackoverflow.com/questions/9812020/what-is-the-lookup-time-complexity-of-hashsettiequalitycomparert for hashset performance. – Shiv Mar 11 '14 at 04:23
  • 3
    @Shiv I don't believe that's correct. EF will take any collection and translate it into SQL. The type of collection should be a non-issue. – Rob Mar 05 '15 at 15:01
  • @Rob I'm skeptical - at a loss to explain the performance difference if that is the case then. Might have to analyse the binary to see what it has done. – Shiv Mar 05 '15 at 22:19
  • 1
    HashSet is not IEnumerable. IEnumerables calling .Contains in LINQ perform poorly (at least pre-EF6.) – Jason Beck Nov 06 '17 at 19:34
  • @Rob is correct here I think. `new []{1, 2, 3, 4}.Contains(column)` will simply be translated to `... WHERE column IN (1, 2, 3, 4)` in SQL, so using a `HashSet` will not make any difference in performance. This optimization is only worth it if `Main` is an `IEnumerable`. – RoadRunner Apr 20 '20 at 10:17
5

Editing my original answer - There is a possible workaround, depending on the complexity of your entities. If you know the sql that EF generates to populate your entities, you can execute it directly using DbContext.Database.SqlQuery. In EF 4, I think you could use ObjectContext.ExecuteStoreQuery, but I didn't try it.

For example, using the code from my original answer below to generate the sql statement using a StringBuilder, I was able to do the following

var rows = db.Database.SqlQuery<Main>(sql).ToArray();

and the total time went from approximately 26 seconds to 0.5 seconds.

I will be the first to say it's ugly, and hopefully a better solution presents itself.

update

After a bit more thought, I realized that if you use a join to filter your results, EF doesn't have to build that long list of ids. This could be complex depending on the number of concurrent queries, but I believe you could use user ids or session ids to isolate them.

To test this, I created a Target table with the same schema as Main. I then used a StringBuilder to create INSERT commands to populate the Target table in batches of 1,000 since that's the most SQL Server will accept in a single INSERT. Directly executing the sql statements was much faster than going through EF (approx 0.3 seconds vs. 2.5 seconds), and I believe would be ok since the table schema shouldn't change.

Finally, selecting using a join resulted in a much simpler query and executed in less than 0.5 seconds.

ExecuteStoreCommand("DELETE Target");

var ids = Main.Select(a => a.Id).ToArray();
var sb = new StringBuilder();

for (int i = 0; i < 10; i++)
{
    sb.Append("INSERT INTO Target(Id) VALUES (");
    for (int j = 1; j <= 1000; j++)
    {
        if (j > 1)
        {
            sb.Append(",(");
        }
        sb.Append(i * 1000 + j);
        sb.Append(")");
    }
    ExecuteStoreCommand(sb.ToString());
    sb.Clear();
}

var rows = (from m in Main
            join t in Target on m.Id equals t.Id
            select m).ToArray();

rows.Length.Dump();

And the sql generated by EF for the join:

SELECT 
[Extent1].[Id] AS [Id]
FROM  [dbo].[Main] AS [Extent1]
INNER JOIN [dbo].[Target] AS [Extent2] ON [Extent1].[Id] = [Extent2].[Id]

(original answer)

This is not an answer, but I wanted to share some additional information and it is far too long to fit in a comment. I was able to reproduce your results, and have a few other things to add:

SQL Profiler shows the delay is between execution of the first query (Main.Select) and the second Main.Where query, so I suspected the problem was in generating and sending a query of that size (48,980 bytes).

However, building the same sql statement in T-SQL dynamically takes less than 1 second, and taking the ids from your Main.Select statement, building the same sql statement and executing it using a SqlCommand took 0.112 seconds, and that's including time to write the contents to the console.

At this point, I suspect that EF is doing some analysis/processing for each of the 10,000 ids as it builds the query. Wish I could provide a definitive answer and solution :(.

Here's the code I tried in SSMS and LINQPad (please don't critique too harshly, I'm in a rush trying to leave work):

declare @sql nvarchar(max)

set @sql = 'SELECT 
[Extent1].[Id] AS [Id]
FROM [dbo].[Main] AS [Extent1]
WHERE [Extent1].[Id] IN ('

declare @count int = 0
while @count < 10000
begin
    if @count > 0 set @sql = @sql + ','
    set @count = @count + 1
    set @sql = @sql + cast(@count as nvarchar)
end
set @sql = @sql + ')'

exec(@sql)

var ids = Mains.Select(a => a.Id).ToArray();

var sb = new StringBuilder();
sb.Append("SELECT [Extent1].[Id] AS [Id] FROM [dbo].[Main] AS [Extent1] WHERE [Extent1].[Id] IN (");
for(int i = 0; i < ids.Length; i++)
{
    if (i > 0) 
        sb.Append(",");     
    sb.Append(ids[i].ToString());
}
sb.Append(")");

using (SqlConnection connection = new SqlConnection("server = localhost;database = Test;integrated security = true"))
using (SqlCommand command = connection.CreateCommand())
{
    command.CommandText = sb.ToString();
    connection.Open();
    using(SqlDataReader reader = command.ExecuteReader())
    {
        while(reader.Read())
        {
            Console.WriteLine(reader.GetInt32(0));
        }
    }
}
Jeff Ogata
  • 56,645
  • 19
  • 114
  • 127
  • Thank you for your work on this. Knowing you were able to reproduce it makes me feel better--at least I'm not crazy! Unfortunately your workaround doesn't really help in my case because, as you might guess, the example I gave here was simplified as much as possible to isolate the problem. My actual query involves a fairly complicated schema, .Include()'s on several other tables, and a few other LINQ operators as well. – Mike Oct 26 '11 at 04:24
  • @Mike, I added another idea which would work for complex entities. Hopefully it wouldn't be too difficult to implement if you have no other choice. – Jeff Ogata Oct 26 '11 at 06:34
  • I did some tests and I think you're correct that the delay is in creating the SQL before it's executed. I've updated my question with the details. – Mike Oct 26 '11 at 15:14
  • @Mike, were you able to try joining to the ids (see the update in my answer)? – Jeff Ogata Oct 26 '11 at 16:11
  • I wound up using a variation of your approach to solve the performance problem. It wound up being pretty ugly, but probably the best option until (and if) Microsoft resolves this issue. – Mike Nov 15 '11 at 18:06
2

A cacheable alternative to Contains?

This just bit me so I've added my two pence to the Entity Framework Feature Suggestions link.

The issue is definitely when generating the SQL. I have a client on who's data the query generation was 4 seconds but the execution was 0.1 seconds.

I noticed that when using dynamic LINQ and ORs the sql generation was taking just as long but it generated something that could be cached. So when executing it again it was down to 0.2 seconds.

Note that a SQL in was still generated.

Just something else to consider if you can stomach the initial hit, your array count does not change much, and run the query a lot. (Tested in LINQ Pad)

Dave
  • 317
  • 4
  • 11
2

The issue is with Entity Framework's SQL generation. It cannot cache the query if one of the parameters is a list.

To get EF to cache your query you can convert your list to a string and do a .Contains on the string.

So for example this code would run much faster since EF could cache the query:

var ids = Main.Select(a => a.Id).ToArray();
var idsString = "|" + String.Join("|", ids) + "|";
var rows = Main
.Where (a => idsString.Contains("|" + a.Id + "|"))
.ToArray();

When this query is generated it will likely be generated using a Like instead of an In so it will speed up your C# but it could potentially slow down your SQL. In my case I didn't notice any performance decrease in my SQL execution, and the C# ran significantly faster.

user2704238
  • 495
  • 4
  • 7
  • 1
    Nice idea, but this will not make use of any index on the column in question. – spender Apr 27 '17 at 15:54
  • Yes, that's true, which is why I mentioned that it could slow down the SQL execution. I guess this is just a potential alternative if you can't use predicate builder and you are working with a small enough data set so that you can afford to not use an index. I also suppose that I should have mentioned that predicate builder is the preferred option – user2704238 May 23 '17 at 14:51
  • 1
    What an AMAZING solution. We managed to raise our production query runtime from ~12,600 milliseconds to just ~18 milliseconds. This is HUGE improvement. Thank you very much !!! – Jacob Jun 08 '19 at 01:01
  • @spender This solution will honor your indexes: https://stackoverflow.com/a/70587979/2206145 – yv989c Jan 06 '22 at 23:55