1

I have a List containing a high number of items - up to 10,000.

I am looking for the most efficient way to exclude these from an IQueryable/List.

Due to the complexity of the process involved in obtaining this list of Ids, it isn't possible to do this within a query.

The example below has an extremely high overhead and wondered if anybody might be able to explain possible reasons for this and if there's a better way to achieve this?

results = from q1 in results 
  where excludedRecords.All(x => x != q1.ItemId)
  select q1;
Nick
  • 5,844
  • 11
  • 52
  • 98
  • How many items in results? How many in excludedRecords. What type is excludedRecords? – rene Dec 13 '13 at 18:24
  • LINQ eventually comes down to a for loop (in my understanding). I would guess there might be ways to optimize this by limiting the number of items you are looping over. By the way you say, "it isn't possible to do this within a query" -- Maybe not in a single query, but I have a feeling there must be a way to use SQL to help you. That in mind, maybe finding and testing a query approach might be worth it? – drew_w Dec 13 '13 at 18:25
  • 1
    @drew_w this comment is just wrong. Linq does not "comes down to a for loop" – Hogan Dec 13 '13 at 18:26
  • @Hogan According to [this stack](http://stackoverflow.com/questions/3894598/how-is-linq-compiled-into-the-cil) the LINQ lambda is compiled to a function, then that is expanded into the appropriate looping syntax (IEnumerable eventually resolves to a loop as well). Is my understanding of this incorrect? – drew_w Dec 13 '13 at 18:28
  • To my understanding the reason people like LINQ is for readability as well as because the compiled lambda isn't called until the variable is accessed. – drew_w Dec 13 '13 at 18:29
  • @drew_w - it is incorrect because of the optimization put in (hashing on operands make it much faster than a standard operands) and the lazy evaluation. Hard to call lazy evaluation a "loop" in the traditional sense. – Hogan Dec 13 '13 at 18:31
  • @Hogan Technically then, you are saying this comes down to a loop in a moderately well written function. Don't get me wrong - I really like LINQ, I just think its worth remembering that it still has to loop sooner or later, so limiting the number of results that it has to process is absolutely going to affect the performance. That was the point of me talking about it being a loop. Hope that makes sense! – drew_w Dec 13 '13 at 18:33
  • This is exactly the kind of optimizations that will work in this case -- the user wants a hash look up for existence -- if he asks linq correctly he will probably see about 3 orders of magnitude speed increase over a traditional look lookup table with a loop. – Hogan Dec 13 '13 at 18:34
  • @drew_w - No, technically it does not come down to a loop, it is a given we are talking technically, this is StackOverflow.com. Would you call a JOIN performed on an SQL Server a loop? No, you have to look at the execution plan and see what it did. Linq is using the same technology, when it can it does an hashed lookup and gets O(1) performance as op. to the O(N) performance of a loop then you WON'T see a big change in performance based on the number elements. – Hogan Dec 13 '13 at 18:40
  • @Hogan You mention hashing but creating a hash map/table of the data takes time and a loop as well. I don't see how LINQ could possibly pull down data from the database and never iterate over it at least once. Can you provide me some documentation showing this? I'm far from an expert at LINQ so I'm really just honestly looking to figure out how this works. I have been and will continue to search google as well to see what I can find! – drew_w Dec 13 '13 at 18:43
  • In fact, maybe I will move this to a question instead of just having a conversation in the comments! : ) – drew_w Dec 13 '13 at 18:44
  • @drew_w - my email is in my profile, email me and I will send you some examples you can play with. – Hogan Dec 13 '13 at 18:45
  • @Hogan I couldn't find much via google but I want to share with others in the community so I went ahead and asked [a new question](http://stackoverflow.com/questions/20573950/what-is-linq-actually-compiled-to). If you are still interested in sharing, lets move the conversation there! : ) – drew_w Dec 13 '13 at 18:57

2 Answers2

2

From the shape of your query, I take excludedRecords to be a list of integers. Further, since you tag LINQ to Entities, I take results to be a DbSet in a DbContext.

This is the problem of combining local lists (excludedRecords) with an IQueryable that waits to be translated into SQL (results). For EF to be able to translate the complete expression (your query) into SQL, it has to translate this local list into "something" that can be part of a SQL statement. With All(), and many other set-based LINQ statements, and when joining the local list, EF does this by building a temp table (of sorts) from single-row tables. With only 5 elements in the local list, this looks like

SELECT ...
    FROM [dbo].[Table] AS [Extent1]
    WHERE  EXISTS (SELECT 
        1 AS [C1]
        FROM  (SELECT 
            1 AS [C1]
            FROM  ( SELECT 1 AS X ) AS [SingleRowTable1]
        UNION ALL
            SELECT 
            2 AS [C1]
            FROM  ( SELECT 1 AS X ) AS [SingleRowTable2]
        UNION ALL
            SELECT 
            3 AS [C1]
            FROM  ( SELECT 1 AS X ) AS [SingleRowTable3]
        UNION ALL
            SELECT 
            4 AS [C1]
            FROM  ( SELECT 1 AS X ) AS [SingleRowTable4]
        UNION ALL
            SELECT 
            5 AS [C1]
            FROM  ( SELECT 1 AS X ) AS [SingleRowTable5]) AS [UnionAll4]
        WHERE ([Extent1].[Id] = [UnionAll4].[C1]) OR (CASE WHEN ([Extent1].[Id] <> [UnionAll4].[C1]) THEN cast(1 as bit) WHEN ([Extent1].[Id] = [UnionAll4].[C1]) THEN cast(0 as bit) END IS NULL)
    )

Although this potentially generates huge SQL statements, it's still workable when the local list doesn't contain "too many" elements (let's say, up to 1000).

The only statement that allows EF to use the local list more efficiently is Contains. Contains can easily be translated into a SQL IN statement. If we rewrite your query to the equivalent with Contains, which is also the answer to your question, ...

results = from q1 in results 
          where !excludedRecords.Contains(q1.ItemId)
          select q1;

... the SQL query will look like

SELECT ...
    FROM [dbo].[Table] AS [Extent1]
    WHERE  NOT ([Extent1].[Id] IN (1, 2, 3, 4, 5))

The IN statement can handle more elements than this "temp table", although this number is still limited (maybe 3000).

Gert Arnold
  • 105,341
  • 31
  • 202
  • 291
1

This is just a fragment of the code, but it looks like you have two lists - results and excludedRecords. For each element in results you iterate over all the elements in excludedRecords. This is why it is slow, it is O(N x M)

Linq and sql solve this with joining, if you join (or the equivalent) you should see some nice performance since that will me something like O(NlgM)

It would look something like this (can't test it right now)

var results2 = from q1 in results
                join x in excludedRecords on q1.LeadID = x into joined
                from z in joined.DefaultIfEmpty()
                where z == null
                select q1;
Hogan
  • 69,564
  • 10
  • 76
  • 117
  • Thanks for the comments Hogan. Running the above against the results throws a stack overflow. The results object is of type IQueryable – Nick Dec 13 '13 at 18:41
  • @Nick - I'd have to test myself to answer -- I probably have a typo, but you get the idea, do left join and only take the ones that are null like in SQL. I'm sure there are some examples if you google. Can't test right now, sadly. – Hogan Dec 13 '13 at 18:43
  • 2
    The problem is you can't join an in-memory list with an `IQueryable` coming from Entity Framework. – Gert Arnold Dec 13 '13 at 19:19