1

I tried to implement custom Linq Chunk function and found this code example This function should separate IEnumerable into IEnumerable of concrete size

public static class EnumerableExtentions
{
    public static IEnumerable<IEnumerable<T>> Batch<T>(this IEnumerable<T> source, int size)
    {
        using (var enumerator = source.GetEnumerator())
        {
            while (enumerator.MoveNext())
            {
                int i = 0;
                IEnumerable<T> Batch()
                {
                    do yield return enumerator.Current;
                    while (++i < size && enumerator.MoveNext());
                }
                yield return Batch();
            }
        }
    }
}

So, I have a question.Why when I try to execute some Linq operation on the result, they are incorrect? For example:

IEnumerable<int> list = Enumerable.Range(0, 10);
Console.WriteLine(list.Batch(2).Count()); // 10 instead of 5

I have an assumption, that it happens because inner IEnumerable Batch() is only triggered when Count() is called, and something goes wrong there, but I don't know what exactly.

Kosmonik
  • 23
  • 4
  • The outer loop shouldn't be calling MoveNext(). – ScottyD0nt Oct 31 '22 at 17:30
  • The problem is you didn't skip elements in your loop. – Rivo R. Oct 31 '22 at 17:40
  • Side note: what you trying to do (have two or more iterators actively pointing to different positions into the original sequence) is simply impossible. So whatever you try will fail one way or another. You have to non-lazily iterate inner sequences. – Alexei Levenkov Oct 31 '22 at 17:51

3 Answers3

2

I have an assumption, that it happens because inner IEnumerable Batch() is only triggered when Count() is called

It's the opposite. The inner IEnumerable is not consumed, when you call Count. Count only consumes the outer IEnumerable, which is this one:

while (enumerator.MoveNext())
{
    int i = 0;
    IEnumerable<T> Batch()
    {
        // the below is not executed by Count!
        // do yield return enumerator.Current;
        // while (++i < size && enumerator.MoveNext());
    }
    yield return Batch();
}

So what Count would do is just move the enumerator to the end, and counts how many times it moved it, which is 10.

Compare that to how the author of this likely have intended this to be used:

foreach (var batch in someEnumerable.Batch(2)) {
    foreach(var thing in batch) {
        // ...
    }
}

I'm also consuming the inner IEnumerables using an inner loop, hence running the code inside the inner Batch. This yields the current element, then also moves the source enumerator forward. It yields the current element again before the ++i < size check fails. The outer loop is going to move forward the enumerator again for the next iteration. And that is how you have created a "batch" of two elements.

Notice that the "enumerator" (which came from someEnumerable) in the previous paragraph is shared between the inner and outer IEnumerables. Consuming either the inner or outer IEnumerable will move the enumerator, and it is only when you consume both the inner and outer IEnumerables in a very specific way, does the sequence of things in the previous paragraph happen, leading to you getting batches.

In your case, you can consume the inner IEnumerables by calling ToList:

Console.WriteLine(list.Batch(2).Select(x => x.ToList()).Count()); // 5

While sharing the enumerator here allows the batches to be lazily consumed, it limits the client code to only consume it in very specific ways. In the .NET 6 implementation of Chunk, the batches (chunks) are eagerly computed as arrays:

public static IEnumerable<TSource[]> Chunk<TSource>(this IEnumerable<TSource> source, int size)

You can do a similar thing in your Batch by calling ToArray() here:

yield return Batch().ToArray();

so that the inner IEnumerables are always consumed.

Sweeper
  • 213,210
  • 22
  • 193
  • 313
1

You have created an interator in an iterator but only the outer iterator gets executed at the Count(). If you wanted to execute the inner you needed to enumerate it, for example:

var batches = list.Batch(3);
foreach(var batch in batches) // the outer is executed
{
    int count = batch.Count(); // the inner iterator is executed now
}

Wel, i would suggest a different approach for the Chunk method like this:

public static IEnumerable<IEnumerable<T>> Batch<T>(this IEnumerable<T> source, int size)
{
    T[]? bucket = null;
    var count = 0;

    foreach (var item in source)
    {
        bucket ??= new T[size];
        bucket[count++] = item;

        if (count != size)
            continue;

        yield return bucket;

        bucket = null;
        count = 0;
    }

    if (count > 0)
    {
        Array.Resize(ref bucket, count);
        yield return bucket;
    }
}
Tim Schmelter
  • 450,073
  • 74
  • 686
  • 939
  • I think it would be better to skip alternative implemntation in favor of linking to https://stackoverflow.com/questions/419019/split-list-into-sublists-with-linq and spend more time explaining why OP's code does not work (as they probably don't understand lazy evaluation) as well as explaining that two "pointers" into the same enumerable is not possible (I don't have nice explanation, otherwise I'd post myself)... – Alexei Levenkov Oct 31 '22 at 17:56
  • I know about your approach and have already implemented it. But I was interested in behavior of IEnumerable and yield. So, now I understand , thanks a lot – Kosmonik Oct 31 '22 at 18:38
-2

Try this way :

public static IEnumerable<IEnumerable<T>> Batch<T>(this IEnumerable<T> arr, int size)
{
  for (var i = 0; i < arr.Count() / size + 1; i++)
  {
    yield return arr.Skip(i * size).Take(size);
  }
}
Rivo R.
  • 351
  • 2
  • 8
  • 2
    1. This in no way answers the question asked. 2) this is a horrifically inefficient implementation of this method, given how much it's re-iterating the sequence from the start over and over 3) this iterates the source many times, which is particularly problematic if the sequence has side effects, or does any expensive computations (the most common of which being that the sequence performs DB or other IO operations to get the data), and additionally it may not produce the same number of items each time it is iterate, therefore said multiple enumeration affects both performance and correctness. – Servy Oct 31 '22 at 17:47