0

I am using Entity Framework and frequently run into issue where I want to iterate through large numbers of records. My issue is that if I pull them all at once, I risk a time out; if I pull one at a time, literally every single record will be a separate query and it takes forever.

I want to implement a Linq extension that pulls the results in batches but can still be used as an IEnumerable. I would give it a set of keys (most likely the primary IDs of whatever records I'm pulling), a batch size (higher for simple objects, lower for complex objects), and a Func that defines how to apply a set of keys to a set of record type T. I would call it like this:

//get the list of items to pull--in this case, a set of order numbers
List<int> orderNumbers = GetOrderNumbers();

//set the batch size
int batchSize = 100;

//loop through the set using BatchedSelector extension. Note the selection
//function at the end which allows me to 
foreach (var order in dbContext.Orders.BatchedSelector(repairNumbers, batchSize, (o, k) => k.Contains(o.OrderNumber)))
{
    //do things
}

Here's my draft solution:

    /// <summary>
    /// A Linq extension that fetches IEnumerable results in batches, aggregating queries
    /// to improve EF performance. Operates transparently to application and acts like any
    /// other IEnumerable.
    /// </summary>
    /// <typeparam name="T">Header record type</typeparam>
    /// <param name="source">Full set of records</param>
    /// <param name="keys">The set of keys that represent specific records to pull</param>
    /// <param name="selector">Function that filters the result set to only those which match the key set</param>
    /// /// <param name="maxBatchSize">Maximum number of records to pull in one query</param>
    /// <returns></returns>
    public static IEnumerable<T> BatchedSelector<T>(this IEnumerable<T> source, IEnumerable<int> keys, Func<T, IEnumerable<int>, bool> selector, int maxBatchSize)
    {
        //the index of the next key (or set of keys) to process--we start at 0 of course
        int currentKeyIndex = 0;             

        //to provide some resiliance, we will allow the batch size to decrease if we encounter errors
        int currentBatchSize = maxBatchSize;
        int batchDecreaseAmount = Math.Max(1, maxBatchSize / 10); //10%, but at least 1

        //other starting variables; a list to hold results and the associated batch of keys
        List<T> resultList = null;
        IEnumerable<int> keyBatch = null;

        //while there are still keys remaining, grab the next set of keys
        while ((keyBatch = keys.Skip(currentKeyIndex).Take(currentBatchSize)).Count() > 0)
        {
            //try to fetch the results
            try
            {
                resultList = source.Where(o => selector(o, keyBatch)).ToList();  // <-- this is where errors occur
                currentKeyIndex += maxBatchSize;  //increment key index to mark these keys as processed
            }
            catch
            {
                //decrease the batch size for our retry
                currentBatchSize -= batchDecreaseAmount;

                //if we've run out of batch overhead, throw the error
                if (currentBatchSize <= 0) throw;

                //otherwise, restart the loop
                continue;
            }

            //since we've successfully gotten the set of keys, yield the results
            foreach (var match in resultList) yield return match;
        }

        //the loop is over; we're done
        yield break;
    }

For some reason, the "where" clause has no effect. I've validated that the correct keys are in keyBatch, but the expected WHERE OrderNumber IN (k1, k2, k3, kn) line is not there. It is as if I didn't have the where statement at all.

My best guess is that I need to build the expression and compile it, but I'm not sure if that's the problem and I'm not really sure how to go about fixing it. Would love any input. Thanks!

NinjaDeveloper
  • 1,620
  • 3
  • 19
  • 51
Daniel
  • 1,695
  • 15
  • 33

2 Answers2

1

Where, Skip, Take and all this kind of methods are extensions methods, not members of IEnumerable<T>. For all these methods are actually 2 versions, one for IEnumerable<> and one for IQueryable<>.

Enumerable extensions

  • Where(Func<TSource, bool> predicate)
  • Select(Func<TSource, TResult> selector)

Queryable extensions

  • Where(Expression<Func<TSource, bool>> predicate)
  • Select(Expression<Func<TSource, TResult>> predicate)

As you can see the difference is that Queryable extensions take an Expression<> instead of a direct delegate. These expression is what allows EF to transform your code to SQL.

Since you are declaring your variables/parameters in BatchedSelector() method as IEnumerable<> your are using the extensions in Enumerable class, and this extensions are executed in memory.

A common mistake is think that due to polymorphism, a DbSet (IQueryable<>) no matter if you use it as IEnumerable<> the queries will be translated to SQL, this is true only for proper members, but not for the extension methods.

Your code can be fixed changing your IEnumerable<> variables/parameters to IQueryable<>.

You can read more about the differences between IEnumerable and IQueryable here.

Community
  • 1
  • 1
Arturo Menchaca
  • 15,783
  • 1
  • 29
  • 53
0

Firstly, thank you Arturo. You set me on the right track for this solution. I went in figuring it was a Linq->Entity issue, but these problems are still far from intuitive for me to resolve.

Second, I borrowed heavily from Shimmy's answer to this question. Thanks Shimmy!

First, I updated the method to support key types other than integers, because why not. So the method signature is now (note the change to IQueryable source):

public static IEnumerable<T> BatchedSelector<T, TKey>(this IQueryable<T> source, Expression<Func<T, TKey>> selector, IEnumerable<TKey> keys, int maxBatchSize)

The method stayed substantially the same other than the line that was producing errors, which is now replaced with:

resultList = source.WhereIn(selector, keyBatch).ToList();

WhereIn is a Linq extension mostly borrowed from Shimmy:

    public static IQueryable<T> WhereIn<T, TKey>(this IQueryable<T> source, Expression<Func<T, TKey>> selector, IEnumerable<TKey> keyCollection)
    {
        if (selector == null) throw new ArgumentNullException("Null selector");
        if (keyCollection == null) throw new ArgumentNullException("Null collection");

        //if no items in collection, no results
        if (!keyCollection.Any()) return source.Where(t => false);

        //assemble expression
        var p = selector.Parameters.Single();
        var equals = keyCollection.Select(value => (Expression)Expression.Equal(selector.Body, Expression.Constant(value, typeof(TKey))));
        var body = equals.Aggregate((accumulate, equal) => Expression.Or(accumulate, equal));

        //return expression
        return source.Where(Expression.Lambda<Func<T, bool>>(body, p));
    }

This taught me something pretty cool: if you feed a where clause of a bunch of constant comparisons, it will be converted to a SQL In statement! Neat!

With those changes, the method produces results quickly and easily.

Community
  • 1
  • 1
Daniel
  • 1,695
  • 15
  • 33