4

I wrote a method to subdivide a list of items into multiple lists using System.Linq. When I run this method for 50000 of simple integers it takes about 59.862 seconds.

Stopwatch watchresult0 = new Stopwatch();
watchresult0.Start();
var result0 = SubDivideListLinq(Enumerable.Range(0, 50000), 100).ToList();
watchresult0.Stop();
long elapsedresult0 = watchresult0.ElapsedMilliseconds;

So I tried to boost it, and wrote it with a simple loop iterating over each item in my list and it only needs 4 milliseconds:

Stopwatch watchresult1 = new Stopwatch();
watchresult1.Start();
var result1 = SubDivideList(Enumerable.Range(0, 50000), 100).ToList();
watchresult1.Stop();
long elapsedresult1 = watchresult1.ElapsedMilliseconds;

This is my Subdivide-method using Linq:

private static IEnumerable<List<T>> SubDivideListLinq<T>(IEnumerable<T> enumerable, int count)
{
    while (enumerable.Any())
    {
        yield return enumerable.Take(count).ToList();
        enumerable = enumerable.Skip(count);
    }
}

And this is my Subdivide-method with the foreach loop over each item:

private static IEnumerable<List<T>> SubDivideList<T>(IEnumerable<T> enumerable, int count)
{
    List<T> allItems = enumerable.ToList();

    List<T> items = new List<T>(count);
    foreach (T item in allItems)
    {
        items.Add(item);

        if (items.Count != count) continue;
        yield return items;
        items = new List<T>(count);
    }

    if (items.Any())
        yield return items;
}

you have any idea, why my own implementation is so much faster than dividing with Linq? Or am I doing something wrong?

And: As you can see, I know how to split lists, so this is not a duplicated of the related question. I wanted to know about performance between linq and my implementation. Not how to split lists

Matthias Burger
  • 5,549
  • 7
  • 49
  • 94
  • 2
    The LINQ version will execute the query twice per iteration. It will also have empty loops that always have to find the last position whereas your optimzed method can continue processing. It will also intialize the list with the correct size whereas LINQ has to resize the internal array everytime. – Tim Schmelter Dec 19 '16 at 11:45
  • 1
    also don't use `while (enumerable.Any())`. with some iterators you may miss values. you must either safely get all values like using `foreach` or else use `MoveNext` and `Current` – M.kazem Akhgary Dec 19 '16 at 11:48
  • but `enumerable.Any()` returns just true/false, when `enumerable` is assigned to the `Skip`ped items. Why would I miss values here? – Matthias Burger Dec 19 '16 at 11:50
  • Some iterators don't reset their state. that is `Any` will advance the position of iterator. let say you get value `1`... but next time you use iterator you get `2`. this is not the case for must iterators but you should be careful – M.kazem Akhgary Dec 19 '16 at 11:51
  • @nvoigt WT*? where is that a duplicate? I know how to split lists, just wanted to know about the performance. Some people really should start reading the questions. Thanks M.kazemAkhgary, I'll remove the question. I will just use my implementation, and OK. – Matthias Burger Dec 19 '16 at 11:54
  • 1
    Yeah, I don't think this is a duplicate because the "duplicate" post does not cover the performance. Actually the poor performance LINQ code is coming from a (for some unknown reason) highly upvoted answer. – Ivan Stoev Dec 19 '16 at 11:58
  • You obviously don't know how to split lists *correctly*. Have a look at the duplicate on how to improve your code and your performance problem will be gone. – nvoigt Dec 19 '16 at 11:58
  • @nvoigt So all questions can be marked as duplicate, because the people don't know how to program _correctly_. Thanks for the discussion! – Matthias Burger Dec 19 '16 at 12:00
  • @MatthiasBurger Just don't use the LINQ approach. The problem is that every iteration executes all the previous queries, so at some point you'll have 500 chained `MoveNext`. And `Skip` is not optimized. – Ivan Stoev Dec 19 '16 at 12:04
  • 1
    If you don't want to learn from a duplicate, that's your problem. I don't see value in copy/pasting half the thread over to this thread. *Performance* is something you worry over after you did it *right*. As long as you do it wrong, there is little value in performance. I once wrote a packing algorithm that was perfect. It was super fast and super compressing. Wasn't able to uncompress it though. Oops. Get it correct first, fast later, because fast is worth nothing as long as there are mistakes. – nvoigt Dec 19 '16 at 12:04
  • And just for the record, enumerating a sequence more than once is something I would consider a mistake on this one because you never know where it comes from and evaluating the `IEnumerable` might as well take longer than your whole method. – nvoigt Dec 19 '16 at 12:06
  • @nvoigt so not all SO-users are experts, just remember. Both methods return correct values. As you can see, the linked question has an answere, with the same Linq-method that is upvoted over 200 times. – Matthias Burger Dec 19 '16 at 12:11
  • 1
    Thanks for all the answeres, I think I got it know :) – Matthias Burger Dec 19 '16 at 12:14
  • 2
    fyi, [MoreLinq](https://github.com/morelinq/MoreLINQ) has [`Batch`](https://github.com/morelinq/MoreLINQ/blob/master/MoreLinq/Batch.cs) –  Dec 19 '16 at 12:19
  • 1
    @AndreasNiedermair wow that's nice - and would do nearly the same like my method. Thanks I will take a look at the library and what it offers. :) – Matthias Burger Dec 19 '16 at 12:22
  • Also, you might want to consider the use of a profiler to see what your code is causing Linq to do, underneath the hood. – Chris O Dec 19 '16 at 13:49

2 Answers2

1

If someone comes here, with the same question:

So finally I did some more research and found, that the multiple enumeration with System.Linq is the cause of performance:

When I'm enumerating it to an array, to avoid the multiple enumeration, the performance gets much better (14 ms / 50k items):

T[] allItems = enumerable as T[] ?? enumerable.ToArray();
while (allItems.Any())
{
    yield return allItems.Take(count);
    allItems = allItems.Skip(count).ToArray();
}

Still, I won't use the linq approach, since it's slower. Instead I wrote an extension-method to subdivide my lists and it takes 3ms for 50k items:

public static class EnumerableExtensions
{
    public static IEnumerable<List<T>> Subdivide<T>(this IEnumerable<T> enumerable, int count)
    {

        List<T> items = new List<T>(count);
        int index = 0;
        foreach (T item in enumerable)
        {
            items.Add(item);
            index++;
            if (index != count) continue;
            yield return items;
            items = new List<T>(count);
            index = 0;
        }
        if (index != 0 && items.Any())
            yield return items;
    }
}

Like @AndreasNiedermair already wrote, this is also contained in MoreLinq-Library, called Batch. (But I won't add the library now for just this one method)

Matthias Burger
  • 5,549
  • 7
  • 49
  • 94
0

If you are after readability and performance You may want to use this algorithm instead. in terms of speed this one is really close to your non-linq version. at the same time its much more readable.

private static IEnumerable<List<T>> SubDivideListLinq<T>(IEnumerable<T> enumerable, int count)
{
    int index = 0;
    return enumerable.GroupBy(l => index++/count).Select(l => l.ToList());
}

And its alternative:

private static IEnumerable<List<T>> SubDivideListLinq<T>(IEnumerable<T> enumerable, int count)
{
    int index = 0;
    return from l in enumerable
        group l by index++/count
        into l select l.ToList();
}

Another alternative:

private static IEnumerable<List<T>> SubDivideListLinq<T>(IEnumerable<T> enumerable, int count)
{
    int index = 0;
    return enumerable.GroupBy(l => index++/count, 
                             item => item, 
                             (key,result) => result.ToList());
}

In my computer I get linq 0.006 sec versus non-linq 0.002 sec which is completely fair and acceptable to use linq.

As an advice, don't torture your self with micro optimizing code. clearly no one is gonna feel the difference of few milliseconds, so write a code that later you and others can understand easily.

M.kazem Akhgary
  • 18,645
  • 8
  • 57
  • 118
  • 1
    yeah, you are completely right, and I love to use linq. a difference of 0.004 seconds is okay. a difference of ~ 1 minute is not, thats why I started the question :) this looks much more readable and pretty nice. With your comments above and your answere now, I'm going to accept your answere and use it. – Matthias Burger Dec 19 '16 at 14:15
  • Now you both ruined the original post. I've participated in reopening it as non duplicate since it was about specific implementation performance. This answer and the accept are putting it back to duplicate, which contains a lot of answers of **how** to perform split, including similar to (or exactly the same as) this. If you believe this answer adds something to the subject, go and post it under question which was identified as duplicate. It definitely does not answer **this post** question. – Ivan Stoev Dec 19 '16 at 14:38