how do I chunk an enumerable?

Question

I need an elegant method that takes an enumerable and gets the enumerable of enumerables each of the same number of elements in it but the last one:

public static IEnumerable<IEnumerable<TValue>> Chunk<TValue>(this IEnumerable<TValue> values, Int32 chunkSize)
{
    // TODO: code that chunks
}

This is what I have tried:

    public static IEnumerable<IEnumerable<TValue>> Chunk<TValue>(this IEnumerable<TValue> values, Int32 chunkSize)
    {
        var count = values.Count();
        var numberOfFullChunks = count / chunkSize;
        var lastChunkSize = count % chunkSize;
        for (var chunkIndex = 0; chunkSize < numberOfFullChunks; chunkSize++)
        {
            yield return values.Skip(chunkSize * chunkIndex).Take(chunkSize);
        }
        if (lastChunkSize > 0)
        {
            yield return values.Skip(chunkSize * count).Take(lastChunkSize);
        }
    }

UPDATE Just discovered there was a similar topic about splitting a list Split List into Sublists with LINQ

See the `Batch` method of [morelinq](http://code.google.com/p/morelinq/) — L.B, Sep 12 '12 at 13:25
possible duplicate of [LINQ Partition List into Lists of 8 members](http://stackoverflow.com/questions/3773403/linq-partition-list-into-lists-of-8-members) — Rawling, Sep 12 '12 at 13:42
possible duplicate of [Split List into Sublists with LINQ](http://stackoverflow.com/questions/419019/split-list-into-sublists-with-linq) — nawfal, Feb 18 '13 at 10:38
https://learn.microsoft.com/en-us/dotnet/api/system.linq.enumerable.chunk?view=net-7.0 seems supported in net7 now — cwhsu, Dec 22 '22 at 10:24

score 25 · Answer 1 · edited May 19 '20 at 12:10

25

If memory consumption isn't a concern, then like this?

static class Ex
{
    public static IEnumerable<IEnumerable<TValue>> Chunk<TValue>(
        this IEnumerable<TValue> values, 
        int chunkSize)
    {
        return values
               .Select((v, i) => new {v, groupIndex = i / chunkSize})
               .GroupBy(x => x.groupIndex)
               .Select(g => g.Select(x => x.v));
    }
}

Otherwise you could get creative with the yield keyword, like so:

static class Ex
{
    public static IEnumerable<IEnumerable<TValue>> Chunk<TValue>(
                    this IEnumerable<TValue> values, 
                    int chunkSize)
    {
        using(var enumerator = values.GetEnumerator())
        {
            while(enumerator.MoveNext())
            {
                yield return GetChunk(enumerator, chunkSize).ToList();
            }
        }
    }

    private static IEnumerable<T> GetChunk<T>(
                     IEnumerator<T> enumerator,
                     int chunkSize)
    {
        do
        {
            yield return enumerator.Current;
        } while(--chunkSize > 0 && enumerator.MoveNext());
    }
}

edited May 19 '20 at 12:10

Peter Lillevold

33,668
7
97
131

answered Sep 12 '12 at 13:35

spender

117,338
33
229
351

1

I like the second like one because it doesn't use Count. – Trident D'Gao Sep 12 '12 at 13:48
`Enumerable.Range(0, 50).Chunk(10).ToArray()` (using the second `Chunk`) gives me 50 `IEnumerable`s, each of which just returns `49`, which is... not what I'd expect. – Rawling Sep 12 '12 at 13:52
@Rawling, yes I realized that before your comment and fixed... see ToList() in the first loop. – spender Sep 12 '12 at 13:53
1

Remove this .ToList() projection. Method will be at least two times faster. – Mabakay Nov 12 '21 at 13:08
1

If you want a fun puzzle: Your yield code breaks if you define it as an extension method on `List` rather than on `IEnumerable`. [Try it here](https://dotnetfiddle.net/nOCWcW). If you want a hint on why this happens, [see here](https://stackoverflow.com/q/76051418/87698). – Heinzi Apr 19 '23 at 11:00
1

Interesting @Heinzi. My head is so far out of dotnet-world right now, I'd struggle to change lanes, but I'll come back to this some day! :) – spender Apr 19 '23 at 11:42

dani herrera · Answer 2 · 2023-01-12T07:24:12.883

24

>= .Net 6

Built-in Enumerable.Chunk method:

// Giving an enumerable
var e = Enumerable.Range(1, 999);

// Here it is. Enjoy :)
var chunks = e.Chunk(29);

// Sample, iterating over chunks
foreach(var chunk in chunks) // for each chunk
{
    foreach(var item in chunk) // for each item in a chunk
    {
        Console.WriteLine(item);
    }
}

< .Net 6 ?

Copy and paste MS Chunk source code into your project. Just few lines of code.

edited Jan 12 '23 at 07:24

answered Oct 01 '21 at 20:09

dani herrera

48,760
8
117
177

1

Love builtins :) – aloisdg Dec 03 '22 at 10:52
FYI this built-in creates a new array and copies the chunk into it. – Nick Strupat Jan 11 '23 at 21:49
1

@NickStrupat, yes, it does, it returns a new array for each chunk. Added a link to source code on answer. Thanks. – dani herrera Jan 12 '23 at 07:25

Tom · Answer 3 · 2022-12-05T08:47:03.997

7

Here is a extension method using Take and Skip:

public static IList<IList<T>> Chunk<T>(this IList<T> source, int chunksize)
{
    while (source.Any())
    {
        yield return source.Take(chunksize);
        source = source.Skip(chunksize);
    }
}

(updated to use IList rather than IEnumerable)

edited Dec 05 '22 at 08:47

answered May 14 '14 at 10:51

Tom

12,591
13
72
112

1

It will cause multiple enumeration of the source collection, which might be a problem – Serhii Kimlyk Apr 29 '21 at 14:07
Multiple enumeration is not a good property of a method that takes an `IEnumerable`. This function would be fine if it took an `IList`, because the caller would know that it requires eager evaluation of the entire set. Neither enumerating multiple times nor eagerly materializing any part of the IEnumerable (besides the first chunk, perhaps) is good. – ErikE Oct 14 '21 at 23:31

score 3 · Answer 4 · answered Jan 24 '22 at 18:10

If you don't have .net 6, you might opt to patch the Chunk method from it into your project. The only adaptations you'll likely need to make are in relation to the exception helpers the .net source uses, as your own project probably won't have ThrowHelper in.

Their code:

ThrowHelper.ThrowArgumentNullException(ExceptionArgument.source);

would probably be more like:

throw new ArgumentNullException(nameof(source));

The following code block has had these adjustments applied; you can make a new file called Chunk.cs and drop the following code into it:

// Licensed to the .NET Foundation under one or more agreements.
// The .NET Foundation licenses this file to you under the MIT license.

using System.Collections.Generic;

namespace System.Linq
{
    public static partial class Enumerable
    {
        /// <summary>
        /// Split the elements of a sequence into chunks of size at most <paramref name="size"/>.
        /// </summary>
        /// <remarks>
        /// Every chunk except the last will be of size <paramref name="size"/>.
        /// The last chunk will contain the remaining elements and may be of a smaller size.
        /// </remarks>
        /// <param name="source">
        /// An <see cref="IEnumerable{T}"/> whose elements to chunk.
        /// </param>
        /// <param name="size">
        /// Maximum size of each chunk.
        /// </param>
        /// <typeparam name="TSource">
        /// The type of the elements of source.
        /// </typeparam>
        /// <returns>
        /// An <see cref="IEnumerable{T}"/> that contains the elements the input sequence split into chunks of size <paramref name="size"/>.
        /// </returns>
        /// <exception cref="ArgumentNullException">
        /// <paramref name="source"/> is null.
        /// </exception>
        /// <exception cref="ArgumentOutOfRangeException">
        /// <paramref name="size"/> is below 1.
        /// </exception>
        public static IEnumerable<TSource[]> Chunk<TSource>(this IEnumerable<TSource> source, int size)
        {
            if (source == null)
            {
                throw new ArgumentNullException(nameof(source));
            }

            if (size < 1)
            {
                throw new ArgumentOutOfRangeException(nameof(size));
            }

            return ChunkIterator(source, size);
        }

        private static IEnumerable<TSource[]> ChunkIterator<TSource>(IEnumerable<TSource> source, int size)
        {
            using IEnumerator<TSource> e = source.GetEnumerator();
            while (e.MoveNext())
            {
                TSource[] chunk = new TSource[size];
                chunk[0] = e.Current;

                int i = 1;
                for (; i < chunk.Length && e.MoveNext(); i++)
                {
                    chunk[i] = e.Current;
                }

                if (i == chunk.Length)
                {
                    yield return chunk;
                }
                else
                {
                    Array.Resize(ref chunk, i);
                    yield return chunk;
                    yield break;
                }
            }
        }
    }
}

_{You should verify that incorporating their MIT licensed code into your project does not unduly impact your own license intentions}

Why no one talk about this post? No words to display happiness. — Utkarsh Dubey, Nov 28 '22 at 09:40

Richard Dalton · Answer 5 · 2012-09-12T14:07:03.653

0

Only had some quick testing but this seems to work:

public static IEnumerable<IEnumerable<TValue>> Chunk<TValue>(this IEnumerable<TValue> values, Int32 chunkSize)
{
    var valuesList = values.ToList();
    var count = valuesList.Count();        
    for (var i = 0; i < (count / chunkSize) + (count % chunkSize == 0 ? 0 : 1); i++)
    {
        yield return valuesList.Skip(i * chunkSize).Take(chunkSize);
    }
}

edited Sep 12 '12 at 14:07

answered Sep 12 '12 at 13:46

Richard Dalton

35,513
6
73
91

7

Repeated iteration of the source sequence *may* be problematic. – spender Sep 12 '12 at 13:51
@spender Good point, I was trying to keep it short so ignored the resharper warning, but it is only one extra line. – Richard Dalton Sep 12 '12 at 14:07
Eagerly fully enumerating the IEnumerable is problematic, because of memory consumption. If streaming processing is possible, it should be used. Consider reading a 2 Gb file, or several of them. If the reader wants to take only a few pieces at a time, it should be able to expect using such a method that takes an `IEnumerable` that the entire file won't get loaded into memory just to return the first chunk. – ErikE Oct 14 '21 at 23:29

score 0 · Answer 6 · answered Sep 27 '22 at 18:00

As other answers have already pointed out, from .NET 6 upwards, there is an Enumerable.Chunk extension method.

Unfortunately (in my opinion), this method returns an IEnumerable<T[]>, which undermines the memory-conservation benefits of processing an IEnumerable<T> one element at a time:

public IEnumerable<HugeObject> CreateHugeObjects(int count) {
  for (var i = 0; i < count; ++i) {
    yield return new HugeObject(i);
  }
}

public static int AggregateSomehow(IEnumerable<HugeObject> chunk) {
  return 0;
}

public void Consume() {
  var source = CreateHugeObjects(1000);
  var chunks = source.Chunk(100);
  var result = chunks.Select(AggregateSomehow);
}

In this example, the underlying type of chunk in AggregateSomehow will be HugeObject[100], meaning that 100 instances of HugeObject have to be loaded into memory simultaneously to perform the method call.

Before the availability of Enumerable.Chunk, I used to write my own extension named Partition like so:

public static IEnumerable<IEnumerable<T>> Partition<T>(this IEnumerable<T> source, int size) {
  if (source == null) {
    throw new ArgumentNullException(nameof(source));
  }

  if (size < 1) {
    throw new ArgumentOutOfRangeException(nameof(size));
  }

  using var e = source.GetEnumerator();
  while (e.MoveNext()) {
    yield return new Partitioner<T>(e, size);
  }
}

private class Partitioner<T> : IEnumerable<T>
{
  private class PartitionerEnumerator : IEnumerator<T>
  {

    private readonly IEnumerator<T> m_Source;
    private readonly int m_Size;

    private int m_Index = -1;
    private bool m_Disposed = false;

    public PartitionerEnumerator(IEnumerator<T> source, int size) {
      m_Source = source;
      m_Size = size;
    }

    public T Current => m_Source.Current;

    object IEnumerator.Current => Current;

    public void Dispose() {
      if (!m_Disposed) {
        m_Disposed = true;
        while (++m_Index < m_Size && m_Source.MoveNext()) { }
      }
    }

    public bool MoveNext() {
      if (m_Index == -1) {
        ++m_Index;
        return true;
      } else {
        return ++m_Index < m_Size && m_Source.MoveNext();
      }
    }

    public void Reset() => throw new NotImplementedException();
  }

  private readonly PartitionerEnumerator m_Enumerator;

  public Partitioner(IEnumerator<T> source, int size) {
    m_Enumerator = new PartitionerEnumerator(source, size);
  }

  public IEnumerator<T> GetEnumerator() => m_Enumerator;
  
  IEnumerator IEnumerable.GetEnumerator() => GetEnumerator();
}

This approach takes into account three considerations:

The original source is only enumerated once (which is often missed by Skip/Take implementations)
Within normal nested/chained LINQ expressions, only one element at a time has to be in memory (which is ignored by the current implementation)
When any partition is processed only partly and then disposed prematurely, the PartitionerEnumerator.Dispose method makes sure that the underlying enumerator still gets forwarded the rest of the count (which is often missed by nested-loop approaches:)

public static IEnumerable<IEnumerable<T>> PartitionWrong<T>(this IEnumerable<T> source, int size) {
  if (source == null) {
    throw new ArgumentNullException(nameof(source));
  }

  if (size < 1) {
    throw new ArgumentOutOfRangeException(nameof(size));
  }

  static IEnumerable<T> EnumeratePartition(IEnumerator<T> e, int size) {
    var i = 0;
    do {
      yield return e.Current;
    } while (++i < size && e.MoveNext())
  }

  using (var e = source.GetEnumerator()) {
    while (e.MoveNext()) {
      yield return EnumeratePartition(e, size);
    }
  }
}

This approach will work if all sub-sequences are fully enumerated, e. g. by calling Count or Sum on them, but it fails for partial enumeration like calling First on them:

var source = new int[] { 1, 2, 3, 4, 5, 6, 7, 8, 9 };
source.PartitionWrong(3).Select(c => c.Count()); // 3, 3, 3
source.PartitionWrong(3).Select(c => c.Sum()); // 6, 15, 24
source.PartitionWrong(3).Select(c => c.First()); // 1, 2, 3, 4, 5, 6, 7, 8, 9 but should be 1, 4, 7

My implementation will work for all of the above, but still has several shortcomings, which weren't relevant for my applications, but the first two are probably the reason why the .NET team chose "the easy way out" and use an array that gets filled immediately:

If you save all Partition objects at first, then were to process them round-robin, one element at a time, the order of the original IEnumerable is not preserved in the partitions, i. e. the first partition is not guaranteed to contain the first three elements. As a side effect, if the number of elements does not divide evenly into the partition size, it is "random" which partition is shorter than size. It is not even necessarily guaranteed that only one partition is shorter.
Using this in a parallelized context suffers from the same issues as (1), but TBH I never even looked into the thread-safetiness of my code.
The benefits of pre-mature enumeration abortion (like calling Any or All on the sub-sequences) will not prevent the rest of the currently enumerated Partion's elements to be created (although this is obviously true for Chunk as well, where all elements got created upon entering the chunk)

So in a nutshell - if you are not planning on using parallelization or aren't dependant on ordered processing, and run into a memory problem when using .NET 6's Chunk, my old code might be helpful to you.

how do I chunk an enumerable?

6 Answers6

>= .Net 6

< .Net 6 ?

Linked

Related