Take pages and combine to list of pages

Question

I have a list, let's say it contains 1000 items. I want to end up with a list of 10 times 100 items with something like:

myList.Select(x => x.y).Take(100) (until list is empty)

So I want Take(100) to run ten times, since the list contains 1000 items, and end up with list containing 10 lists which each contains 100 items.

Look at this question http://stackoverflow.com/questions/419019/split-list-into-sublists-with-linq — Paweł Reszka, Nov 05 '14 at 13:03
@PawełReszka that solution simply splits the entire list which would pull the full 1000 items - the OP is looking for a *paginated* approach where 100 items are queried each time. — James, Nov 05 '14 at 13:05
@James That's something that I was going to ask for clarification on. He says he has a list with 1000 items. It doesn't say `EF` or `SQL` anywhere in the question yet -- so I assumed myList is already a List — hometoast, Nov 05 '14 at 13:06
@James. You could be right but he says "end up with list containing 10 lists which each contains 100 items." — Tom Blodget, Nov 05 '14 at 13:14
@TomBlodget hmm, perhaps I'm the one who has misunderstood the requirement. However, `Skip(x).Take(x)` will do the job either way but if that is the requirement then the linked solution is a neater approach. The OP will have to clarify exactly what they want... — James, Nov 05 '14 at 13:16

James · Answer 1 · 2014-11-05T13:15:59.417

4

You need to Skip the number of records you have already taken, you can keep track of this number and use it when you query

alreadyTaken = 0;
while (alreadyTaken < 1000) {
    var pagedList = myList.Select(x => x.y).Skip(alreadyTaken).Take(100);
    ...
    alreadyTaken += 100;
}

edited Nov 05 '14 at 13:15

answered Nov 05 '14 at 13:02

James

80,725
18
167
237

score 1 · Answer 2 · answered Nov 05 '14 at 13:14

This can be achieved with a simple paging extension method.

public static List<T> GetPage<T>(this List<T> dataSource, int pageIndex, int pageSize = 100)
{
    return dataSource.Skip(pageIndex * pageSize)
        .Take(pageSize)
        .ToList();
}

Of course, you can extend it to accept and/or return any kind of IEnumerable<T>.

David Raab · Accepted Answer · 2014-11-06T13:41:52.413

As already posted you can use a for loop and Skip some elements and Take some elements. In this way you create a new query in every for loop. But a problem raises if you also want to go through each of those queries, because this will be very inefficient. Lets assume you just have 50 entries and you want to go through your list with ten elements every loop. You will have 5 loops doing

.Skip(0).Take(10)
.Skip(10).Take(10)
.Skip(20).Take(10)
.Skip(30).Take(10)
.Skip(40).Take(10)

Here two problem raises.

Skiping elements can still lead to computation. In your first query you just calculate the needed 10 elements, but in your second loop you calculated 20 elements and throwing 10 away, and so on. If you sum all 5 loops together you already computed 10 + 20 + 30 + 40 + 50 = 150 elements even you only had 50 elements. This result in an O(n^2) performance.
Not every IEnumerable does the above thing. Some IEnumerable like a database for example can optimize a Skip, for example they use an Offset (MySQL) definition in the SQL query. But that still doesn't solve the problem. The main problem you still have is that you will create 5 different Queries and execute all 5 of them. Those five queries will now take the most time. Because a simple Query to a database is even a lot slower than just Skipping some in-memory elements or some computations.

Because of all these problems it makes sense to not use a for loop with multiple .Skip(x).Take(y) if you also want to evaluate every query in every loop. Instead your algorithm should only go through your IEnumerable once, executing the query once, and on the first iteration return the first 10 elements. The next iteration returns the next 10 elements and so on, until it runs out of elements.

The following Extension Method does exactly this.

public static IEnumerable<IReadOnlyList<T>> Combine<T>(this IEnumerable<T> source, int amount) {
    var combined = new List<T>();
    var counter  = 0;
    foreach ( var entry in source ) {
        combined.Add(entry);
        if ( ++counter >= amount ) {
            yield return combined;
            combined = new List<T>();
            counter  = 0;
        }
    }

    if ( combined.Count > 0 )
        yield return combined;
}

With this you can just do

someEnumerable.Combine(100)

and you get a new IEnumerable<IReadOnlyList<T>> that goes through your enumeration just once slicing everything into chunks with a maximum of 100 elements.

Just to show how much difference the performance could be:

var numberCount  = 100000;
var combineCount = 100;

var nums  = Enumerable.Range(1, numberCount);
var count = 0;

// Bechmark with Combine() Extension
var swCombine  = Stopwatch.StartNew();
var sumCombine = 0L;
var pages      = nums.Combine(combineCount);
foreach ( var page in pages ) {
    sumCombine += page.Sum();
    count++;
}
swCombine.Stop();
Console.WriteLine("Count: {0} Sum: {1} Time Combine: {2}", count, sumCombine, swCombine.Elapsed);

// Doing it with .Skip(x).Take(y)
var swTakes = Stopwatch.StartNew();
count = 0;
var sumTaken = 0L;
var alreadyTaken = 0;
while ( alreadyTaken < numberCount ) {
    sumTaken += nums.Skip(alreadyTaken).Take(combineCount).Sum();
    alreadyTaken += combineCount;
    count++;
}
swTakes.Stop();
Console.WriteLine("Count: {0} Sum: {1} Time Takes: {2}", count, sumTaken, swTakes.Elapsed);

The usage with the Combine() Extension Methods runs in 3 milliseconds on my computer (i5 @ 4Ghz) while the for loop already needs 178 milliseconds

If you have a lot more elements or the slicing is smaller it gets even more worse. For example if combineCount is set to 10 instead of 100 the runtime changes to 4 milliseconds and 1800 milliseconds (1.8 seconds)

Now you could possibly say that you don't have so much elements or your slicing never gets so small. But remember, in this this example i just generated a sequence of numbers that has nearly zero computation time. The whole overhead from 4 milliseconds to 178 milliseconds is only caused of the re-evaluation and Skiping of values. If you have some more complex stuff going on behind the scenes the Skipping creates the most overhead, and also if an IEnumerable can implement Skip, like a database as explained above, that example will still get more worse, because the most overhead will be the execution of the query itself.

And the amount of queries can go really fast up. With 100.000 elements and a slicing/chunking of 100 you already will execute 1.000 queries. The Combine Extension provided above on the other hand will always execute your query once. And will never suffer of any of those problems described above.

All of that doesn't mean that Skip and Take should be avoided. They have their place. But if you really plan to go through every element you should avoid using Skip and Take to get your slicing done.

If the only thing you want is just to slice everything into pages with 100 elements, and you just want to fetch the third page, for example. You just should calculate how much elements you need to Skip.

var pageCount = 100;
var pageNumberToGet = 3;
var thirdPage = yourEnumerable.Skip(pageCount * (pageNumberToGet-1)).take(pageCount);

In this way you will get the elements from 200 to 300 in a single query. Also an IEnumerable with a databse can optimize that and you just have a single-query. So, if you only want a specific range of elements from your IEnumerable than you should use Skip and Take and do it like above instead of using the Combine Extension Method that i provided.

"*Every time you use .Skip(x).Take(y) the whole `IEnumerable` expression get re-evaluated*" - this is slightly misleading, if the query is coming from a DB context then `Skip` / `Take` certainly wouldn't return the entire result set, it would be optimized to only return the relevant rows. Also, if `myList` was already a `TList` then there would be nothing to enumerate. Your points are valid, however, the answer suggests that this is the case for *any* `IEnumerable`, which isn't the case. — James, Nov 06 '14 at 10:51
@James Yes thats true, actually also for a normal `IEnumerable` `Skip` and `Take` don't execute the query immediately, only when you start getting values from the query afterwards. In the `for` example only `Sum()` starts the execution. But the problem is still that the `for` loop leads to a lot of execution. Even if a Database optimize `Skip` and `Take` you still execute a query for every loop again, and that is even more a problem for a database as to throw some values away. I look into it to rewrite the answer to make the problem clearer. — David Raab, Nov 06 '14 at 11:01
Yeah I understand that `Skip` / `Take` don't materialize the query, I was simply trying to explain the point that even when they do, they wouldn't pull down the entire record set (as your answer *kind of* suggests). I am not sure what you mean by "*a problem for a database as to throw some values away*" - the DB shouldn't be throwing *anything* away, however, if the intention is to provide a *paginated* query then you wouldn't naturally query using a `for` anyway, if you need the entire list then `Skip` / `Take` don't really make sense, it's more efficient to perform 1 query. — James, Nov 06 '14 at 11:13
Yeah, i knew they don't pull down the entire record set. I never said they would, a `.Skip(100).Take(100)` will only pull 200 entries at max. A database can still optimize it to only pull 100 entries. But with 100.000 entries and a page size of 100 you will still create 1.000 queries to the database, and that is the real problem in a database, not if every querie now is optimizes and every 1000 queries just returns 100 elements. _deal with the rest in memory_ => that is exactly what my example `for` code above does, and it is slow using a `.Skip().Take()` approach. — David Raab, Nov 06 '14 at 11:28
I never said you did, I simply said your answer *suggests* that it would by saying "*Every time you use `.Skip(x).Take(y)` the **whole IEnumerable expression get re-evaluated**.*". I agree, if the scenario the OP wants is just a list of items to be split all in one go then I 100% agree that `Skip` / `Take` isn't the right approach. If, however, they want a *paginated* query i.e. I want to pull 100 and then maybe a few seconds later pull another 100 etc. then `Skip` / `Take` is the right way to go about it. Unfortunately the OP hasn't clarified *exactly* what they are after. — James, Nov 06 '14 at 11:32
**I simply said your answer suggests that it would by saying** => Yes, and i said your critic is correct and i will correct my answer!? **If, however, they want a paginated query** => If he only wants a pagination you do not even need a `for` loop. He didn't just wanted to fetch some data in-between, he wanted to slice _everything_ into multiple chunks. So the assumption is he also will evaluate every query. And doing that in a `for` loop with multiple `Skip` `Take` is just slow. If he just need a specific page he could just do `.Skip(pageNumber * pageSize).Take(pageSize)` without a `for` loop — David Raab, Nov 06 '14 at 11:50
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/64390/discussion-between-sid-burn-and-james). — David Raab, Nov 06 '14 at 11:56

Take pages and combine to list of pages

3 Answers3