0

I'm trying to write an algorithm that yields Ts in "batches" of fixed size. (The source could be infinitely long).

Ex.

    int[] arr = { 5, 1, 8, 10, 50, 4, 37, 8 };
    int size = 2;
    foreach(var batch in arr.Batches(size))
    {
        Console.WriteLine(string.Join(",", batch)); 
    }

----->

5,1
8,10
50,4
37,8

Of course I try something like

public static class Extensions
{
    public static IEnumerable<IEnumerable<T>> Batches<T>(this IEnumerable<T> source, int batchSize)
    {
        for(var mover = source.GetEnumerator(); ; )
        {
            IEnumerable<T> batch = LimitMoves(mover, batchSize);
            if(!batch.Any())
            {
                yield break;
            }
            yield return batch;
        }
    }

    private static IEnumerable<T> LimitMoves<T>(IEnumerator<T> mover, int limit)
    {
        for(int i = 0; i < limit && mover.MoveNext(); ++i)
        {
            yield return mover.Current;
        }       
    }
}

and get

1,8
50,4
8
user7127000
  • 3,143
  • 6
  • 24
  • 41

4 Answers4

2

Sergey's is fine, except I hate infinite loops with alternate means of breaking. Why not use the language structs are they are designed:

public static IEnumerable<IEnumerable<T>> Batches<T>(this IEnumerable<T> source, 
                                                     int batchSize)
{
    var mover = source.GetEnumerator();
    while(mover.MoveNext()) 
            yield return LimitMoves(mover, batchSize);
}
James Curran
  • 101,701
  • 37
  • 181
  • 258
1

Simply move to next item before you enter LimitMoves method and inside that method yield current item without additional MoveNext() call (see notes below to understand why your current code is not working and what other issues your code has):

public static IEnumerable<IEnumerable<T>> Batches<T>(
   this IEnumerable<T> source, int batchSize)
{
    for (var mover = source.GetEnumerator(); ;)
    {
        if (!mover.MoveNext()) // there is no items for next batch
            yield break;
        else
            yield return LimitMoves(mover, batchSize);
    }
}

private static IEnumerable<T> LimitMoves<T>(IEnumerator<T> mover, int limit)
{
    // if you are here then there is an item which you can yield
    do
    {
        yield return mover.Current;
    }
    while (--limit > 0 && mover.MoveNext());
}

Output:

5,1
8,10
50,4
37,8

Note 1: your problem was in batch.Any() call which moved 'cursor' to the next item in the source sequence before you entered LimitMoves method. Then in the for loop condition, you moved once again at limit && mover.MoveNext() verification. Thus item which was current when you entered LimitMoves was not yielded.

Note 2: You should always dispose enumerator, and use loops appropriately - don't use a for loop for variable initialization and iterating without any condition - it makes your code hard to understand and maintain. Loop condition should be explicit and easy to see. E.g.

public static IEnumerable<IEnumerable<T>> Batches<T>(
   this IEnumerable<T> source, int batchSize)
{
    using(var mover = source.GetEnumerator())
    {
        while (mover.MoveNext())
           yield return LimitMoves(mover, batchSize);
    }
}

Note 3: as @Rene stated, you should understand, that this approach requires full consuming of each batch when you are enumerating batches. Similar solution, as well as alternatives can be found here: Create batches in LINQ

Sergey Berezovskiy
  • 232,247
  • 41
  • 429
  • 459
  • 1
    @TimSchmelter No, it's not. The OP's code skips an item for every batch, because it was written incorrectly by iterating a sequence that causes side effects multiple times. Granted, the question doesn't properly explain the problem that it's having (making it a poor question). – Servy Jul 18 '17 at 14:31
  • @Servythe OP properly explained the goal and symptoms. He's not expected to know the root cause of the problem. – James Curran Jul 18 '17 at 14:33
  • 1
    @JamesCurran The OP explained what the entire project is. They didn't explain the problem that they had with their solution. Obviously they aren't expected to know the root cause, but they *are* expected to explain what isn't working with their solution. In this specific case, the OP should be saying that the first item in each batch isn't being shown, even if they don't know why. – Servy Jul 18 '17 at 14:35
  • It's *really* bothering me that you've got a loop with no loop condition, and then a loop condition inside of the loop that breaks it. Could you just put the logical loop condition into the condition of the loop. – Servy Jul 18 '17 at 14:39
  • @Servy He gave what he wanted, what he had written, and the (wrong) values he was getting. And as you were writing your comment, I was thinking the same thing and your second comment, and writting my own solution to it. – James Curran Jul 18 '17 at 14:42
  • Your second solution of course now has me realizing that your first doesn't dispose of the enumerator. – Servy Jul 18 '17 at 14:42
  • @Servy in the first solution I just fixed issue with 'peeking' next items which OP was asking for. But I also added a sample with correct loop usage and disposing enumerator – Sergey Berezovskiy Jul 18 '17 at 14:43
  • @JamesCurran And yet doesn't even so much as say that the values are wrong, just give a list of values, with no question, or problem statement at all. – Servy Jul 18 '17 at 14:43
  • @SergeyBerezovskiy Yes, the first solution is peeking the next item appropriately, but it's not disposing of the enumerator when it's done. That's important. – Servy Jul 18 '17 at 14:44
  • If you're going to have a note explaining why your first solution is wrong, why are you proposing your first solution? Just replace that one with your second one, which doesn't have the problems you've mentioned. – Servy Jul 18 '17 at 14:51
  • @Servy sorry, but there is no single correct way of answer composition. I prefer to fix problem which OP asks to fix (and my first solution does that), then or before I explain why code of OP didn't work, and later I suggest improvements which are not related to original question - disposing enumerators, correct naming, etc – Sergey Berezovskiy Jul 18 '17 at 14:54
  • @RenéVogt thanks, yes, I have noted that as well :) It's a known issue of returning enumerators - same as [here](https://stackoverflow.com/a/17598878/470005) – Sergey Berezovskiy Jul 18 '17 at 14:58
  • I'm still puzzled, it seems to work with a single call to `string.Join`, but `ToArray()` already raises an `InvalidOperationException` because the "enumeration has been finished" (don't know if that's the correct english wording, I have a german exception message). – René Vogt Jul 18 '17 at 15:01
  • @RenéVogt Given the way the method is implemented, you need to iterate each inner sequence before you ask for the next sequence. – Servy Jul 18 '17 at 15:19
  • @RenéVogt that's because in order to convert batches to array you should enumerate each batch. Then you put enumerators to array. But each enumerator is at the last position. And you see an error. This way of batching is good if you need to consume batches all and once. Otherwise you should put each batch into collection instead of returning enumerators. See the linked question from note 3 – Sergey Berezovskiy Jul 18 '17 at 15:19
0

The .Any() method causes the underlying enumerator to advance which was causing a step to be skipped. Here is a version with the code corrected.

public static class Extensions
{
    public static IEnumerable<IEnumerable<T>> Batches<T>(this IEnumerable<T> source, 
                                                              int batchSize)
    {
        using (var enumerator = source.GetEnumerator())
            while (enumerator.MoveNext())
                yield return enumerator.GetPage(batchSize);
    }

    private static IEnumerable<T> GetPage<T>(this IEnumerator<T> source, 
                                                  int batchSize)
    {
        for (var i = 0; i < batchSize; i++)
            if (i == 0 || source.MoveNext())
                yield return source.Current;
            else
                yield break; // not really needed but works as an early exit
    }
}

Here is an example use of the above code...

static void Main()
{
    var set = new[] { 5, 1, 8, 10, 50, 4, 37, 8, 1 };
    var batches = set.Batches(2);
    var result = string.Join("\r\n", 
                             batches.Select(batch => string.Join(",", batch)));
    Console.WriteLine(result);
}

... and the results

5,1
8,10
50,4
37,8
1
Matthew Whited
  • 22,160
  • 4
  • 52
  • 69
0

Answers like the one of Sergey or James look nice. But their problem is that the user of these extension has to know exactly how to use the result.

yield tells the compiler to generate a state machine, a type implementing IEnumerable<TSource> with a new MoveNext method that will only be called when you iterate through the sequence. So only the outer MoveNext calls are execute when you call Batches. But the inner MoveNext calls are executed deferred.

So calling things like ToArray() or Count() on the resulting sequence will have strange results. You'll have to iterate through the batches one by one explicitly, even if you only need the 1000th batch.

So as long as there is no memory problem, I prefer a solution like this:

public static IEnumerable<IEnumerable<TSource>> Batches<TSource>(this IEnumerable<TSource> source, int size)
{
    if (source == null) throw new ArgumentNullException(nameof(source));
    if (size <= 0) throw new ArgumentOutOfRangeException(nameof(size), size, "Value must be greater than zero!");

    return BatchIterator(source, size);
}

private static IEnumerable<IEnumerable<TSource>> BatchIterator<TSource>(IEnumerable<TSource> source, int size)
{
    using (IEnumerator<TSource> enumerator = source.GetEnumerator())
    {
        while (enumerator.MoveNext())
        {
            TSource[] array = new TSource[size];
            array[0] = enumerator.Current;
            for (int current = 1; current < size; current++)
            {
                if (!enumerator.MoveNext())
                {
                    yield return array.Take(current);
                    yield break;
                }

                array[current] = enumerator.Current;
            }

            yield return array;
        }
    }
}

This probably will perform worse in terms of memory usage and speed, as James and Servy mentioned. If that's a problem, you can use the "yield-only" solution, but every consumer of your extension then has to be aware of how to correctly iterate the returned batches (see the comments at Sergey's answer and this answer).

René Vogt
  • 43,056
  • 14
  • 77
  • 99
  • 1
    But this requires eager consumption of the batch, and forces an entire batch to be in memory. When you have large batches, that have a large memory footprint, that's a problem. Neither solution is strictly better, they just each have different flaws. – Servy Jul 18 '17 at 15:21
  • Also, preventing the caller from having an underlying `IEnumerable` that's an array is actually harmful, rather than helpful. You're never using the array again, so if the caller *does* cast to it, or even change it, it doesn't really matter; the code still works, and regardless, you don't really need to stop someone from shooting themselves in the foot (after all, they could use reflection to get the array out of the `Select` enumerator too). Having the underlying sequence be an array is valuable though, as many LINQ operations are more effective if called on a collection (i.e. Count). – Servy Jul 18 '17 at 15:24
  • 1
    @Servy agree, but to me it feels strange to have a method that returns a sequence of sequence where the caller has to know exactly how to iterate them correctly. – René Vogt Jul 18 '17 at 15:25
  • @Servy (second comment) interesting point, I have to think about that, wrote this method some months ago and was sure it's a good idea to hide the array, but I maybe totally wrong with that and remove this part from the answer. – René Vogt Jul 18 '17 at 15:27
  • Sure, although the "correct" way to iterate it is the way that you should typically be iterating most sequences, and the way most people actually do iterate most sequences. It's also the *only* way to do it with the smallest possible memory footprint. There is no way of avoiding that particular problem without creating a larger memory footprint (although you do have a few choices as to what you bring in and how). – Servy Jul 18 '17 at 15:28
  • The idea of not making it possible for a caller to cast a returned value to something they shouldn't have is a general practice that I find to be not productive. You're not actually stopping a malicious user from doing something bad in (you're just making them spend an extra few minutes writing the code to do it), and in general, a malicious *developer* writing code is almost always a lost cause already; they already have the tools to hose the system. So then it comes down to whether you think someone could cast the value *by accident*. I personally don't see that as an actual risk. – Servy Jul 18 '17 at 15:32
  • 1
    @Servy whatever is a "correct" way to iterate...I dislike the fact that an `IEnumerable<>` is returned that throws an exception if I simply call `ToArray()` and worse, no matter what I give as `size`, calling `Count()` on the result always returns `8`. – René Vogt Jul 18 '17 at 15:33
  • And I'm sure that you'd dislike it even more if the code threw an Out Of Memory exception every single time you did anything with the sequence at all, due to a single batch not fitting in memory. Obviously if a larger memory footprint isn't a concern, you can get additional features that make the sequence much more usable. If you *can't* afford the additional memory footprint, then you have no choice, and your solution *simply doesn't work*, regardless of what features it's intended to support. – Servy Jul 18 '17 at 15:34
  • 1
    @Servy as I said, I agree that both solutions have pros and cons...just wanted to make clear why I prefer the array-solution (as long as memory usage is not a problem). – René Vogt Jul 18 '17 at 15:36