As other answers have already pointed out, from .NET 6 upwards, there is an Enumerable.Chunk
extension method.
Unfortunately (in my opinion), this method returns an IEnumerable<T[]>
, which undermines the memory-conservation benefits of processing an IEnumerable<T>
one element at a time:
public IEnumerable<HugeObject> CreateHugeObjects(int count) {
for (var i = 0; i < count; ++i) {
yield return new HugeObject(i);
}
}
public static int AggregateSomehow(IEnumerable<HugeObject> chunk) {
return 0;
}
public void Consume() {
var source = CreateHugeObjects(1000);
var chunks = source.Chunk(100);
var result = chunks.Select(AggregateSomehow);
}
In this example, the underlying type of chunk
in AggregateSomehow
will be HugeObject[100]
, meaning that 100 instances of HugeObject
have to be loaded into memory simultaneously to perform the method call.
Before the availability of Enumerable.Chunk
, I used to write my own extension named Partition
like so:
public static IEnumerable<IEnumerable<T>> Partition<T>(this IEnumerable<T> source, int size) {
if (source == null) {
throw new ArgumentNullException(nameof(source));
}
if (size < 1) {
throw new ArgumentOutOfRangeException(nameof(size));
}
using var e = source.GetEnumerator();
while (e.MoveNext()) {
yield return new Partitioner<T>(e, size);
}
}
private class Partitioner<T> : IEnumerable<T>
{
private class PartitionerEnumerator : IEnumerator<T>
{
private readonly IEnumerator<T> m_Source;
private readonly int m_Size;
private int m_Index = -1;
private bool m_Disposed = false;
public PartitionerEnumerator(IEnumerator<T> source, int size) {
m_Source = source;
m_Size = size;
}
public T Current => m_Source.Current;
object IEnumerator.Current => Current;
public void Dispose() {
if (!m_Disposed) {
m_Disposed = true;
while (++m_Index < m_Size && m_Source.MoveNext()) { }
}
}
public bool MoveNext() {
if (m_Index == -1) {
++m_Index;
return true;
} else {
return ++m_Index < m_Size && m_Source.MoveNext();
}
}
public void Reset() => throw new NotImplementedException();
}
private readonly PartitionerEnumerator m_Enumerator;
public Partitioner(IEnumerator<T> source, int size) {
m_Enumerator = new PartitionerEnumerator(source, size);
}
public IEnumerator<T> GetEnumerator() => m_Enumerator;
IEnumerator IEnumerable.GetEnumerator() => GetEnumerator();
}
This approach takes into account three considerations:
- The original
source
is only enumerated once (which is often missed by Skip/Take
implementations)
- Within normal nested/chained LINQ expressions, only one element at a time has to be in memory (which is ignored by the current implementation)
- When any partition is processed only partly and then disposed prematurely, the
PartitionerEnumerator.Dispose
method makes sure that the underlying enumerator still gets forwarded the rest of the count (which is often missed by nested-loop approaches:)
public static IEnumerable<IEnumerable<T>> PartitionWrong<T>(this IEnumerable<T> source, int size) {
if (source == null) {
throw new ArgumentNullException(nameof(source));
}
if (size < 1) {
throw new ArgumentOutOfRangeException(nameof(size));
}
static IEnumerable<T> EnumeratePartition(IEnumerator<T> e, int size) {
var i = 0;
do {
yield return e.Current;
} while (++i < size && e.MoveNext())
}
using (var e = source.GetEnumerator()) {
while (e.MoveNext()) {
yield return EnumeratePartition(e, size);
}
}
}
This approach will work if all sub-sequences are fully enumerated, e. g. by calling Count
or Sum
on them, but it fails for partial enumeration like calling First
on them:
var source = new int[] { 1, 2, 3, 4, 5, 6, 7, 8, 9 };
source.PartitionWrong(3).Select(c => c.Count()); // 3, 3, 3
source.PartitionWrong(3).Select(c => c.Sum()); // 6, 15, 24
source.PartitionWrong(3).Select(c => c.First()); // 1, 2, 3, 4, 5, 6, 7, 8, 9 but should be 1, 4, 7
My implementation will work for all of the above, but still has several shortcomings, which weren't relevant for my applications, but the first two are probably the reason why the .NET team chose "the easy way out" and use an array that gets filled immediately:
- If you save all
Partition
objects at first, then were to process them round-robin, one element at a time, the order of the original IEnumerable
is not preserved in the partitions, i. e. the first partition is not guaranteed to contain the first three elements. As a side effect, if the number of elements does not divide evenly into the partition size, it is "random" which partition is shorter than size
. It is not even necessarily guaranteed that only one partition is shorter.
- Using this in a parallelized context suffers from the same issues as (1), but TBH I never even looked into the thread-safetiness of my code.
- The benefits of pre-mature enumeration abortion (like calling
Any
or All
on the sub-sequences) will not prevent the rest of the currently enumerated Partion's elements to be created (although this is obviously true for Chunk
as well, where all elements got created upon entering the chunk)
So in a nutshell - if you are not planning on using parallelization or aren't dependant on ordered processing, and run into a memory problem when using .NET 6's Chunk
, my old code might be helpful to you.