16

I'm looking for an elegant way to execute a Contains() statement in a scalable way. Please allow me to give some background before I come to the actual question.

The IN statement

In Entity Framework and LINQ to SQL the Contains statement is translated as a SQL IN statement. For instance, from this statement:

var ids = Enumerable.Range(1,10);
var courses = Courses.Where(c => ids.Contains(c.CourseID)).ToList();

Entity Framework will generate

SELECT 
    [Extent1].[CourseID] AS [CourseID], 
    [Extent1].[Title] AS [Title], 
    [Extent1].[Credits] AS [Credits], 
    [Extent1].[DepartmentID] AS [DepartmentID]
    FROM [dbo].[Course] AS [Extent1]
    WHERE [Extent1].[CourseID] IN (1, 2, 3, 4, 5, 6, 7, 8, 9, 10)

Unfortunately, the In statement is not scalable. As per MSDN:

Including an extremely large number of values (many thousands) in an IN clause can consume resources and return errors 8623 or 8632

which has to do with running out of resources or exceeding expression limits.

But before these errors occur, the IN statement becomes increasingly slow with growing numbers of items. I can't find documentation about its growth rate, but it performs well up to a few thousands of items, but beyond that it gets dramatically slow. (Based on SQL Server experiences).

Scalable

We can't always avoid this statement. A JOIN with the source data in stead would generally perform much better, but that's only possible when the source data is in the same context. Here I'm dealing with data coming from a client in a disconnected scenario. So I have been looking for a scalable solution. A satisfactory approach turned out to be cutting the operation into chunks:

var courses = ids.ToChunks(1000)
                 .Select(chunk => Courses.Where(c => chunk.Contains(c.CourseID)))
                 .SelectMany(x => x).ToList();

(where ToChunks is this little extension method).

This executes the query in chunks of 1000 that all perform well enough. With e.g. 5000 items, 5 queries will run that together are likely to be faster than one query with 5000 items.

But not DRY

But of course I don't want to scatter this construct all over my code. I am looking for an extension method by which any IQueryable<T> can be transformed into a chunky executing statement. Ideally something like this:

var courses = Courses.Where(c => ids.Contains(c.CourseID))
              .AsChunky(1000)
              .ToList();

But maybe this

var courses = Courses.ChunkyContains(c => c.CourseID, ids, 1000)
              .ToList();

I've given the latter solution a first shot:

public static IEnumerable<TEntity> ChunkyContains<TEntity, TContains>(
    this IQueryable<TEntity> query, 
    Expression<Func<TEntity,TContains>> match, 
    IEnumerable<TContains> containList, 
    int chunkSize = 500)
{
    return containList.ToChunks(chunkSize)
               .Select (chunk => query.Where(x => chunk.Contains(match)))
               .SelectMany(x => x);
}

Obviously, the part x => chunk.Contains(match) doesn't compile. But I don't know how to manipulate the match expression into a Contains expression.

Maybe someone can help me make this solution work. And of course I'm open to other approaches to make this statement scalable.

Gert Arnold
  • 105,341
  • 31
  • 202
  • 291

5 Answers5

14

I’ve solved this problem with a little different approach a view month ago. Maybe it’s a good solution for you too.

I didn’t want my solution to change the query itself. So a ids.ChunkContains(p.Id) or a special WhereContains method was unfeasible. Also should the solution be able to combine a Contains with another filter as well as using the same collection multiple times.

db.TestEntities.Where(p => (ids.Contains(p.Id) || ids.Contains(p.ParentId)) && p.Name.StartsWith("Test"))

So I tried to encapsulate the logic in a special ToList method that could rewrite the Expression for a specified collection to be queried in chunks.

var ids = Enumerable.Range(1, 11);
var result = db.TestEntities.Where(p => Ids.Contains(p.Id) && p.Name.StartsWith ("Test"))
                                .ToChunkedList(ids,4);

To rewrite the expression tree I discovered all Contains Method calls from local collections in the query with a view helping classes.

private class ContainsExpression
{
    public ContainsExpression(MethodCallExpression methodCall)
    {
        this.MethodCall = methodCall;
    }

    public MethodCallExpression MethodCall { get; private set; }

    public object GetValue()
    {
        var parent = MethodCall.Object ?? MethodCall.Arguments.FirstOrDefault();
        return Expression.Lambda<Func<object>>(parent).Compile()();
    }

    public bool IsLocalList()
    {
        Expression parent = MethodCall.Object ?? MethodCall.Arguments.FirstOrDefault();
        while (parent != null) {
            if (parent is ConstantExpression)
                return true;
            var member = parent as MemberExpression;
            if (member != null) {
                parent = member.Expression;
            } else {
                parent = null;
            }
        }
        return false;
    }
}

private class FindExpressionVisitor<T> : ExpressionVisitor where T : Expression
{
    public List<T> FoundItems { get; private set; }

    public FindExpressionVisitor()
    {
        this.FoundItems = new List<T>();
    }

    public override Expression Visit(Expression node)
    {
        var found = node as T;
        if (found != null) {
            this.FoundItems.Add(found);
        }
        return base.Visit(node);
    }
}

public static List<T> ToChunkedList<T, TValue>(this IQueryable<T> query, IEnumerable<TValue> list, int chunkSize)
{
    var finder = new FindExpressionVisitor<MethodCallExpression>();
    finder.Visit(query.Expression);
    var methodCalls = finder.FoundItems.Where(p => p.Method.Name == "Contains").Select(p => new ContainsExpression(p)).Where(p => p.IsLocalList()).ToList();
    var localLists = methodCalls.Where(p => p.GetValue() == list).ToList();

If the local collection passed in the ToChunkedList method was found in the query expression, I replace the Contains call to the original list with a new call to a temporary list containing the ids for one batch.

if (localLists.Any()) {
    var result = new List<T>();
    var valueList = new List<TValue>();

    var containsMethod = typeof(Enumerable).GetMethods(BindingFlags.Static | BindingFlags.Public)
                        .Single(p => p.Name == "Contains" && p.GetParameters().Count() == 2)
                        .MakeGenericMethod(typeof(TValue));

    var queryExpression = query.Expression;

    foreach (var item in localLists) {
        var parameter = new List<Expression>();
        parameter.Add(Expression.Constant(valueList));
        if (item.MethodCall.Object == null) {
            parameter.AddRange(item.MethodCall.Arguments.Skip(1));
        } else {
            parameter.AddRange(item.MethodCall.Arguments);
        }

        var call = Expression.Call(containsMethod, parameter.ToArray());

        var replacer = new ExpressionReplacer(item.MethodCall,call);

        queryExpression = replacer.Visit(queryExpression);
    }

    var chunkQuery = query.Provider.CreateQuery<T>(queryExpression);


    for (int i = 0; i < Math.Ceiling((decimal)list.Count() / chunkSize); i++) {
        valueList.Clear();
        valueList.AddRange(list.Skip(i * chunkSize).Take(chunkSize));

        result.AddRange(chunkQuery.ToList());
    }
    return result;
}
// if the collection was not found return query.ToList()
return query.ToList();

Expression Replacer:

private class ExpressionReplacer : ExpressionVisitor {

    private Expression find, replace;

    public ExpressionReplacer(Expression find, Expression replace)
    {
        this.find = find;
        this.replace = replace;
    }

    public override Expression Visit(Expression node)
    {
        if (node == this.find)
            return this.replace;

        return base.Visit(node);
    }
}
codeworx
  • 2,715
  • 13
  • 12
  • This is a great piece of work! You should share it on Github or Codeplex or something. It's the closest to what I thought "ideal" so I marked it as the answer. The only part that feels a bit unnatural is having to pass the list again in the `ToChunkedList` method, but I don't see how this could be avoided. The ability to use the `Contains` multiple times is brilliant. – Gert Arnold Jul 03 '14 at 07:24
3

Please allow me to provide an alternative to the Chunky approach.

The technique involving Contains in your predicate works well for:

  • A constant list of values (no volatile).
  • A small list of values.

Contains will do great if your local data has those two characteristics because these small set of values will be hardcoded in the final SQL query.

The problem begins when your list of values has entropy (non-constant). As of this writing, Entity Framework (Classic and Core) do not try to parameterize these values in any way, this forces SQL Server to generate a query plan every time it sees a new combination of values in your query. This operation is expensive and gets aggravated by the overall complexity of your query (e.g. many tables, a lot of values in the list, etc.).

The Chunky approach still suffers from this SQL Server query plan cache pollution problem, because it does not parametrizes the query, it just moves the cost of creating a big execution plan into smaller ones that are more easy to compute (and discard) by SQL Server, furthermore, every chunk adds an additional round-trip to the database, which increases the time needed to resolve the query.

An Efficient Solution for EF Core

NEW! QueryableValues EF6 Edition has arrived! For EF Core keep reading below.

Wouldn't it be nice to have a way of composing local data in your query in a way that's SQL Server friendly? Enter QueryableValues.

I designed this library with these two main goals:

  • It MUST solve the SQL Server's query plan cache pollution problem ✅
  • It MUST be fast!

It has a flexible API that allows you to compose local data provided by an IEnumerable<T> and you get back an IQueryable<T>; just use it as if it were another entity of your DbContext (really), e.g.:

// Sample values.
IEnumerable<int> values = Enumerable.Range(1, 1000);

// Using a Join (query syntax).
var query1 = 
    from e in dbContext.MyEntities
    join v in dbContext.AsQueryableValues(values) on e.Id equals v 
    select new
    {
        e.Id,
        e.Name
    };

// Using Contains (method syntax)
var query2 = dbContext.MyEntities
    .Where(e => dbContext.AsQueryableValues(values).Contains(e.Id))
    .Select(e => new
    {
        e.Id,
        e.Name
    });

You can also compose complex types!

It goes without saying that the provided IEnumerable<T> is only enumerated at the time that your query is materialized (not before), preserving the same behavior of EF Core in this regard.

How Does It Works?

Internally QueryableValues creates a parameterized query and provides your values in a serialized format that is natively understood by SQL Server. This allows your query to be resolved with a single round-trip to the database and avoids creating a new query plan on subsequent executions due to the parameterized nature of it.

Useful Links

QueryableValues is distributed under the MIT license

yv989c
  • 1,453
  • 1
  • 14
  • 20
  • @Gert I wasn't sure about the format for my answer because this problem overlaps different use cases and I wanted to avoid duplication using the link. I'll take your advice and tailor a better answer here and in that other question. Thanks for the feedback. – yv989c Jan 20 '22 at 06:32
  • @GertArnold I made the changes to my answer. I hope it provides more value to the question now. – yv989c Jan 21 '22 at 06:27
  • Yes, that's better. Although it's not a solution for the original problem it may be helpful in preventing collateral damage. Worth investigating. – Gert Arnold Jan 21 '22 at 07:55
  • @GertArnold Would you mind to elaborate why it isn't a solution to the original problem? I believe it literally addresses these two points in particular: - "I'm looking for an elegant way to execute a `Contains()` statement in a scalable way." - "A JOIN with the source data in stead would generally perform much better, but that's only possible when the source data is in the same context." – yv989c Jan 21 '22 at 15:04
  • Oh, it's just that the question was written a non-core version of EF (EF-core didn't even exist back then), which is still used in this application. – Gert Arnold Jan 21 '22 at 15:18
  • @GertArnold Ah! fair enough. I'll come back later and update the question with an approach that will work under EF 6 classic for the specific use case in your question. It won't be as elegant due to limitations on that framework but it will have the same performance qualities as QueryableValues. – yv989c Jan 21 '22 at 15:23
  • Hi @GertArnold. I added the following [answer](https://stackoverflow.com/a/70819284/2206145) as you suggested. Your feedback is appreciated. – yv989c Jan 23 '22 at 05:49
  • Just wondering, could you use a table variable instead of this XML string? – Gert Arnold Jan 23 '22 at 10:23
  • Sure. The trade-off would be that now you must separately deploy the type for that table variable to the database. I will perform some benchmarks and if the table variable approach performs significantly well vs XML I can provide that as an option that you can specify at set-up time. Currently the library is self-contained (no external dependencies) so it's easy to use, which was one of my goals too. – yv989c Jan 23 '22 at 15:28
  • Just to clarify: We only need to deploy a single type for this TVP and that will cover ALL the use cases (simple and complex types). I already have a strategy to make that work. I need to see how ADO.net moves the data from that TVP to the SqlCommand... I remember exploring that in the past and it wasn't pretty... The benchmarks will tell! Thanks for the perspective! – yv989c Jan 23 '22 at 15:41
  • I mean something like `DECLARE @tableVar TABLE (Column1 INT, Column2 INT, ...` and use that in the subsequent SQL statement, much like your current xml variable `@px`, and a generated `INSERT` into the table variable, much like your current `SET @px = ...`. I.e. not a pre-defined table *type*, just a *variable*. – Gert Arnold Jan 23 '22 at 15:48
  • But I guess the problem is how to use it in the query as a parameter that EF will understand. – Gert Arnold Jan 23 '22 at 15:51
  • I see. Sadly I believe that path will take you to a dead end. We need to explore solutions that takes advantage of how adonet natively provides parameters data to `sp_executesql`. Anything that just modifies the SQL provided to that sproc will cause plan cache pollution. To properly do what you are proposing we need a way to hack into how `SqlCommand` does this.. not sure this is the way because this is an abstraction below EF. – yv989c Jan 23 '22 at 16:38
  • I posted the benchmarks results... then I realized that I swapped the name of the methods and there were some inconsistencies in the queries... – yv989c Jan 26 '22 at 04:49
  • @GertArnold It may be worth it but I'm not convinced yet. I did a benchmark between XML and TVP using Int32 (plain ADO.NET). On my machine at 1024 items is when the TVP approach starts having some advantage over XML. It seems that at that point the cost of deserializing the XML is greater than the latency required to build and send the TVP parameter data; the way of how ADO.NET builds the parameters for the TVP type is too chatty (my conclusion so far). I need to introduce network latency to see the impact of it. Here are the [preliminary results](https://pastebin.com/0pffpp8B). – yv989c Jan 26 '22 at 05:18
  • Good job. I have cases where the number of items greatly exceeds 1024 (hence the chunking). It may be worth giving developers the option to choose the best strategy. – Gert Arnold Jan 26 '22 at 07:48
  • UPDATE: I provided a solution for EF 6 (non-core) [here](https://stackoverflow.com/a/70587979/2206145). – yv989c Jul 21 '22 at 04:59
  • @GertArnold sir! Let me introduce you to [QueryableValues `EF6 Edition`!](https://github.com/yv989c/BlazarTech.QueryableValues.EF6) – yv989c Sep 04 '22 at 18:08
  • I saw it in one of your other answers. Also already upvoted, so running out of options to upvote your work! – Gert Arnold Sep 05 '22 at 14:12
2

Linqkit to the rescue! Might be a better way that does it directly, but this seems to work fine and makes it pretty clear what's being done. The addition being AsExpandable(), which lets you use the Invoke extension.

using LinqKit;

public static IEnumerable<TEntity> ChunkyContains<TEntity, TContains>(
    this IQueryable<TEntity> query, 
    Expression<Func<TEntity,TContains>> match, 
    IEnumerable<TContains> containList, 
    int chunkSize = 500)
{
    return containList
            .ToChunks(chunkSize)
            .Select (chunk => query.AsExpandable()
                                   .Where(x => chunk.Contains(match.Invoke(x))))
            .SelectMany(x => x);
}

You might also want to do this:

containsList.Distinct()
            .ToChunks(chunkSize)

...or something similar so you don't get duplicate results if something this occurs:

query.ChunkyContains(x => x.Id, new List<int> { 1, 1 }, 1);
Ocelot20
  • 10,510
  • 11
  • 55
  • 96
1

Another way would be to build the predicate this way (of course, some parts should be improved, just giving the idea).

public static Expression<Func<TEntity, bool>> ContainsPredicate<TEntity, TContains>(this IEnumerable<TContains> chunk, Expression<Func<TEntity, TContains>> match)
        {
            return Expression.Lambda<Func<TEntity, bool>>(Expression.Call(
                typeof (Enumerable),
                "Contains",
                new[]
                {
                    typeof (TContains)
                },
                Expression.Constant(chunk, typeof(IEnumerable<TContains>)), match.Body),
                match.Parameters);
        }

which you could call in your ChunkContains method

return containList.ToChunks(chunkSize)
               .Select(chunk => query.Where(ContainsPredicate(chunk, match)))
               .SelectMany(x => x);
Raphaël Althaus
  • 59,727
  • 6
  • 96
  • 122
0

Using a stored procedure with a table valued parameter could also work well. You in effect write a joint In the stored procedure between your table / view and the table valued parameter.

https://learn.microsoft.com/en-us/dotnet/framework/data/adonet/sql/table-valued-parameters

Willsy
  • 1
  • 1