107

I am developing a C# program which has an "IEnumerable users" that stores the ids of 4 million users. I need to loop through the IEnumerable and extract a batch 1000 ids each time to perform some operations in another method.

How do I extract 1000 ids at a time from start of the IEnumerable, do some thing else, then fetch the next batch of 1000 and so on?

Is this possible?

Kevin Nacios
  • 2,843
  • 19
  • 31
user1526912
  • 15,818
  • 14
  • 57
  • 92
  • 3
    Before rushing into reading the answers, please note that .NET 6 has a new `.Chunk` LINQ method. See duplicate questions for more info – Alex from Jitbit Apr 07 '22 at 14:06

9 Answers9

170

You can use MoreLINQ's Batch operator (available from NuGet):

foreach(IEnumerable<User> batch in users.Batch(1000))
   // use batch

If simple usage of library is not an option, you can reuse implementation:

public static IEnumerable<IEnumerable<T>> Batch<T>(
        this IEnumerable<T> source, int size)
{
    T[] bucket = null;
    var count = 0;

    foreach (var item in source)
    {
       if (bucket == null)
           bucket = new T[size];

       bucket[count++] = item;

       if (count != size)                
          continue;

       yield return bucket.Select(x => x);

       bucket = null;
       count = 0;
    }

    // Return the last bucket with all remaining elements
    if (bucket != null && count > 0)
    {
        Array.Resize(ref bucket, count);
        yield return bucket.Select(x => x);
    }
}

BTW for performance you can simply return bucket without calling Select(x => x). Select is optimized for arrays, but selector delegate still would be invoked on each item. So, in your case it's better to use

yield return bucket;
Sergey Berezovskiy
  • 232,247
  • 41
  • 429
  • 459
  • 1
    I was about to answer this. Don't reinvent the wheel, use MoreLinq's Batch method ;-) – Meta-Knight Mar 14 '13 at 16:11
  • 1
    @Meta-Knight yep, I use theirs MaxBy, DistinctBy, Batch and other methods. It's better to invent new car, than to reinvent wheels :) – Sergey Berezovskiy Mar 14 '13 at 16:13
  • 9
    It's worth noting that the `Select` is there specifically to obscure the underlying array from the caller (not that they can do much with it at that point). Most people won't have a need for it, but for a library method it helps them avoid breaking changes if they change the implementation in the future. – Servy Mar 14 '13 at 20:43
  • MoreLinq is fast and stable, use this – Erik Bergstedt Nov 13 '15 at 12:18
  • 2
    Instead of bucket.Select(x =>x) to make it so the caller can't cast the IEnumerable back to the original T[] array, would bucket.Skip(0) be any better on performance? Passing 0 means it returns every item as an immutable sequence of IEnumerable (it can't be cast back to T[] by the caller). – David Gunderson Mar 19 '16 at 07:35
  • 1
    This implementation actually has a small issue. In cases where 'size' is greater than the number of objects in the collection, you'll get some nulls in the resulting batch. Might not be a big deal, but it will be good to address. – nb.alexiev Jan 31 '18 at 17:35
  • @nb.alexiev probably you did `bucket.Take(size)` instead of `bucket.Take(count)` on the last line – Sergey Berezovskiy Feb 02 '18 at 08:47
  • I'm missing something here: Doesn't `foreach` materialize the collection, which is exactly what we'd like to avoid? – OfirD Oct 22 '18 at 10:45
  • @HeyJude nope, when you use `foreach` with iterator method (the method which returns `IEnumerable` via yielding results) then state machine will be created for enumeration, but it will not materialize collection in this case. State machine will consume source collection step by step on each `foreach` loop – Sergey Berezovskiy Oct 30 '18 at 16:30
  • Added `ToArray()` there: `yield return bucket.Take(count).ToArray();` – Sergey Nudnov Nov 13 '19 at 20:08
62

Sounds like you need to use Skip and Take methods of your object. Example:

users.Skip(1000).Take(1000)

this would skip the first 1000 and take the next 1000. You'd just need to increase the amount skipped with each call

You could use an integer variable with the parameter for Skip and you can adjust how much is skipped. You can then call it in a method.

public IEnumerable<user> GetBatch(int pageNumber)
{
    return users.Skip(pageNumber * 1000).Take(1000);
}
Bill
  • 821
  • 7
  • 14
  • Thanks bill. I didn't quite get what the pagenumber * in (pageNumber * 1000) do I understoood users.Skip(1000).Take(1000) perfectly. An d I would just need to increment skip 1000...2000...3000 and so on – user1526912 Mar 14 '13 at 17:32
  • The pageNumber parameter would be incremented each time you need to retrieve a new batch of users. So for example, when you want to get the 3001 to 4000 batch, you'd call the method: `var usersBatch = GetBatch(3);` – Bill Mar 15 '13 at 10:15
  • 30
    It's not very efficient. Your always starting back at the beginning and walking through the previous results. More efficient would be to retain the IEnumerable where you left off. – Edward Brey Sep 14 '15 at 21:03
  • Another problem with this approach: if the source data is a consuming enumerable, you will be skipping records because they won't be returned by `users` for the second page. – ProgrammingLlama Feb 21 '23 at 02:01
35

The easiest way to do this is probably just to use the GroupBy method in LINQ:

var batches = myEnumerable
    .Select((x, i) => new { x, i })
    .GroupBy(p => (p.i / 1000), (p, i) => p.x);

But for a more sophisticated solution, see this blog post on how to create your own extension method to do this. Duplicated here for posterity:

public static IEnumerable<IEnumerable<T>> Batch<T>(this IEnumerable<T> collection, int batchSize)
{
    List<T> nextbatch = new List<T>(batchSize);
    foreach (T item in collection)
    {
        nextbatch.Add(item);
        if (nextbatch.Count == batchSize)
        {
            yield return nextbatch;
            nextbatch = new List<T>(); 
            // or nextbatch.Clear(); but see Servy's comment below
        }
    }

    if (nextbatch.Count > 0)
        yield return nextbatch;
}
Alex
  • 37,502
  • 51
  • 204
  • 332
p.s.w.g
  • 146,324
  • 30
  • 291
  • 331
  • 1
    Wouldn't that create 4 millions of anonymous objects (not mine downvote)? Also internally it will create lookup for all those objects. – Sergey Berezovskiy Mar 14 '13 at 16:18
  • 1
    @lazyberezovsky yes, I suppose it would (I did say it was the easiest way, not necessarily the best). I added an alternative solution as well. – p.s.w.g Mar 14 '13 at 16:23
  • Yep, good edit. I recommend you also take a look on morelinq implementation with array as bucket. I think it will little faster – Sergey Berezovskiy Mar 14 '13 at 16:26
  • Since the `List` never is never resized in this example, isn't it the same thing? – p.s.w.g Mar 14 '13 at 16:28
  • `List` checks capacity before adding item to inner array, also it increments version after adding item. With simple array you don't have that stuff. – Sergey Berezovskiy Mar 14 '13 at 16:37
  • 1
    @lazyberezovsky good point. I'll upvote for morelinq :) – p.s.w.g Mar 14 '13 at 16:38
  • @p-s-w-g well, your answer is also working, and I really can't tell what will be performance difference for 4 millions users without tests. Maybe 100 ms :) +1 also – Sergey Berezovskiy Mar 14 '13 at 16:43
  • 5
    Note that by not creating a new list each time and instead clearing it you force the caller to enumerate each batch before *requesting* the second. If he doesn't, when he thinks he's enumerating the first batch he'll end up with the values of the second. Also note that you save very little in terms of GC work, since clearing the list will still deallocate the internal buffer, and that's the expensive part. While it may, possibly, save you a tiny bit in a few cases, the significant loss in functionality is usually not worth it. – Servy Mar 14 '13 at 20:39
  • @Servy I hadn't considered that. I suppose that's somewhat related to the *morelinq* code's use of `bucket.Select(x => x)` – p.s.w.g Mar 14 '13 at 22:04
  • @p.s.w.g Nope, it's not. The purpose of the identity select is to prevent the user from casting the `IEnumerable` they were given to an array and then mutating it, or otherwise relying on the implementation detail that it is an array, allowing for future changes to the implementation without causing breaking changes. For anyone not publishing a public API, it's not really an issue and can be excluded, as that's just not something you need to be concerned with. – Servy Mar 15 '13 at 13:59
  • Once `GroupBy` starts enumeration, doesn't it have to fully enumerate its source? This loses lazy evaluation of the source and thus, in some cases, all of the benefit of batching! – ErikE Oct 27 '15 at 18:37
  • @ErikE Correct. That's why I offered the alternative solution which would be much better if proper lazy evaluation is a requirement. – p.s.w.g Oct 27 '15 at 19:10
22

How about

int batchsize = 5;
List<string> colection = new List<string> { "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12"};
for (int x = 0; x < Math.Ceiling((decimal)colection.Count / batchsize); x++)
{
    var t = colection.Skip(x * batchsize).Take(batchsize);
}
Kabindas
  • 802
  • 9
  • 6
18

try using this:

  public static IEnumerable<IEnumerable<TSource>> Batch<TSource>(
        this IEnumerable<TSource> source,
        int batchSize)
    {
        var batch = new List<TSource>();
        foreach (var item in source)
        {
            batch.Add(item);
            if (batch.Count == batchSize)
            {
                 yield return batch;
                 batch = new List<TSource>();
            }
        }

        if (batch.Any()) yield return batch;
    }

and to use above function:

foreach (var list in Users.Batch(1000))
{

}
Zaki
  • 5,540
  • 7
  • 54
  • 91
  • 8
    You can get a minor efficiency bump by providing the batch size to the list constructor. – Servy Mar 14 '13 at 20:42
  • This is good. Fully populating each List in this method is required in order for skipping enumeration of that list in the caller to make the next (and so on) batch have the right items in it. – ErikE Dec 30 '15 at 00:59
5

Something like this would work:

List<MyClass> batch = new List<MyClass>();
foreach (MyClass item in items)
{
    batch.Add(item);

    if (batch.Count == 1000)
    {
        // Perform operation on batch
        batch.Clear();
    }
}

// Process last batch
if (batch.Any())
{
    // Perform operation on batch
}

And you could generalize this into a generic method, like this:

static void PerformBatchedOperation<T>(IEnumerable<T> items, 
                                       Action<IEnumerable<T>> operation, 
                                       int batchSize)
{
    List<T> batch = new List<T>();
    foreach (T item in items)
    {
        batch.Add(item);

        if (batch.Count == batchSize)
        {
            operation(batch);
            batch.Clear();
        }
    }

    // Process last batch
    if (batch.Any())
    {
        operation(batch);
    }
}
JLRishe
  • 99,490
  • 19
  • 131
  • 169
4

You can achieve that using Take and Skip Enumerable extension method. For more information on usage checkout linq 101

  • 1
    Yes, but it might not be very efficient... – H H Mar 14 '13 at 16:10
  • I understand that take can be used but when I take 1000 records how do I take another 1000 from last point – user1526912 Mar 14 '13 at 16:11
  • 1
    @user1526912 - `Skip(i*1000).Take(1000)` with i = 0,1,2,... – H H Mar 14 '13 at 16:11
  • 1
    @user1526912 If you are using .Net framework 4.0 or higher you can use [PLINQ](http://msdn.microsoft.com/en-us/library/dd460688.aspx) to take advantage of multicore processer to process it little faster – Krishnaswamy Subramanian Mar 14 '13 at 16:18
  • This is great for LINQ-to-somequeryprovider in which Skip can be implemented efficiently. For LINQ-To-Objects this will both iterate the source sequence once for each batch, but spend a *lot* of time in `Skip` as it is not efficient for arbitrary sequences. – Servy Mar 14 '13 at 20:41
0

You can use Take operator linq

Link : http://msdn.microsoft.com/fr-fr/library/vstudio/bb503062.aspx

Aghilas Yakoub
  • 28,516
  • 5
  • 46
  • 51
-1

In a streaming context, where the enumerator might get blocked in the middle of the batch, simply because the value is not yet produced (yield) it is useful to have a timeout method so that the last batch is produced after a given time. I used this for example for tailing a cursor in MongoDB. It's a little bit complicated, because the enumeration has to be done in another thread.

    public static IEnumerable<List<T>> TimedBatch<T>(this IEnumerable<T> collection, double timeoutMilliseconds, long maxItems)
    {
        object _lock = new object();
        List<T> batch = new List<T>();
        AutoResetEvent yieldEventTriggered = new AutoResetEvent(false);
        AutoResetEvent yieldEventFinished = new AutoResetEvent(false);
        bool yieldEventTriggering = false; 

        var task = Task.Run(delegate
        {
            foreach (T item in collection)
            {
                lock (_lock)
                {
                    batch.Add(item);

                    if (batch.Count == maxItems)
                    {
                        yieldEventTriggering = true;
                        yieldEventTriggered.Set();
                    }
                }

                if (yieldEventTriggering)
                {
                    yieldEventFinished.WaitOne(); //wait for the yield to finish, and batch to be cleaned 
                    yieldEventTriggering = false;
                }
            }
        });

        while (!task.IsCompleted)
        {
            //Wait for the event to be triggered, or the timeout to finish
            yieldEventTriggered.WaitOne(TimeSpan.FromMilliseconds(timeoutMilliseconds));
            lock (_lock)
            {
                if (batch.Count > 0) //yield return only if the batch accumulated something
                {
                    yield return batch;
                    batch.Clear();
                    yieldEventFinished.Set();
                }
            }
        }
        task.Wait();
    }