Split results from DB into "chunks" of 10

Question

Afternoon,

I need a hand to split over 1200 results into "chunks" of 10 so i can process these results with the Amazon MWS API. Can anyone provide any guidance on how i would go about doing this please?

 List<string> prodASIN = dc.aboProducts.Select(a => a.asin).Take(10).ToList();

I currently have this, which works. But i have 1200+ results and need to loop through each 10 so i can process them and pass them over to the Amazon MWS API

score 2 · Answer 1 · edited May 23 '17 at 12:20

2

I know the question is answered but I can't withhold from you this little extension method I once made and that has served me well since.

You can do:

foreach(var list in prodASINs.ToChunks(10))
{
    // send list
}

edited May 23 '17 at 12:20

Community

1
1

answered Jul 21 '12 at 21:00

Gert Arnold

105,341
31
202
291

DaveShaw · Accepted Answer · 2012-07-18T13:38:35.347

1

Why not try something like:

//Load all the database entries into Memory
List<string> prodASINs = dc.aboProducts.Select(a => a.asin).ToList();
var count = prodASINs.Count();
//Loop through passing 10 at a time to AWS
for (var i = 0; i < count; i++)
{
    var prodASINToSend = prodASINs.Skip(i * 10).Take(10).ToList(); 
    //Send to AWS
}

Or if you don't want to load them all into memory.

var count = dc.aboProducts.Count();
for (var i = 0; i < count; i++)
{
    List<string> prodASIN = dc.aboProducts.OrderBy(a => a.Id).Select(a => a.asin).Skip(i * 10).Take(10).ToList(); 
    //Send to AWS
}

edited Jul 18 '12 at 13:38

answered Jul 18 '12 at 13:23

DaveShaw

52,123
16
112
141

@thatuxguy You will probably need to Order the results if using the second option. – DaveShaw Jul 18 '12 at 13:38
1st way is nicer :) now to pass to Amazon and see if it all works _facepalm_ i had this API! – thatuxguy Jul 18 '12 at 13:42
The first solution could easily use GetRange() avoiding the Skip/Take/ToList calls and the second method runs 1001 queries against the database to retrieve 1000 records (1 query per record). But I agree the code is "nicer", looks pretty and uses fancy method calls like Skip/Take. Oh wait. I need to go, I see something shiney... – Chris Gessler Jul 23 '12 at 12:18

score 1 · Answer 3 · answered Jul 18 '12 at 13:32

Sorry this isnt LINQ specific, but perhaps it will help...

One of the things I have done when working with data with MWS and ERP software is adding a control column to the database, something like "addedASIN'. In the database I define the control column as a boolean value ( or TINYINT(1) in MySQL ) and default the flag to 0 for all new entries and set it to 1 when the entry has been added.

If you are able to do that then you can do something like

SELECT asin FROM datasource WHERE addedASIN = 0 LIMIT 10;

Then once MWS returns successful for the additions update the flag using

UPDATE datasource SET addedASIN = 1 WHERE asin = 'asinnumber';

The benefit I have found with this is that your database will be able to stop and start with a minimal repetition of data - for instance in my case ( and what started this control column ) our network connection can be flaky, so I was finding during order imports I would lose connectivity resulting in lost orders, or orders being uploaded to our system twice.

This solution has mitigated that by having at most 1 order being added twice as a result of a connectivity loss, and in order for that order to be uploaded twice, connectivity needs to be lost between sending the data to our ERP system, our ERP system acknowledging it was added and the database being updated, which for a round trip takes approximately 30 seconds.

Thanks for your feedback, most of my DB has 'bit' fields so it knows when it updates something. really wish the API was a bit more simply to use lol — thatuxguy, Jul 18 '12 at 13:34
ha, I know what you mean. What about using the bit field to control the data you take so you can limit your results to 10 at a time vs returning your entire dataset at one go? It may take some additional processing, but I found that sometimes the quickest way with MWS is the longest route.. — Robert H, Jul 18 '12 at 13:36
I was going to use a datetime field so if they datetime = today ignore ;) that way the low prices should only get updated daily :) — thatuxguy, Jul 18 '12 at 13:39

Chris Gessler · Answer 4 · 2012-07-22T17:05:20.870

0

Slice Extension (for Arrays):

    public static T[] Slice<T>(this T[] source, int index, int length)
    {
        T[] slice = new T[length];
        Array.Copy(source, index, slice, 0, length);
        return slice;
    }

Array.Copy is extremely fast, a lot faster than the Select/Skip/Take pattern. Although this method is not the fasted I've found, recents tests show that it's nearly 400 times faster than the Skip/Take pattern used to split Lists and Arrays.

To use it as is:

const int arraySize = 10;
List<string> listSource = whatever;
string[] source = listSource.ToArray();

for (int i = 0; i < source.Length; i += arraySize)
{
    List<string> buffer = source.Slice(i, arraySize).ToList();
    DoSomething(buffer);
}

edited Jul 22 '12 at 17:05

answered Jul 18 '12 at 13:34

Chris Gessler

22,727
7
57
83

how do i get the array back to pass into `request.ASINList.ASIN = prodASIN;` this requires a List – thatuxguy Jul 18 '12 at 13:36
@thatuxguy - I finally got around to testing this solution and updated my answer. – Chris Gessler Jul 22 '12 at 14:03

Chris Gessler · Answer 5 · 2012-07-22T17:05:00.760

-1

List<T> has a built-in function called GetRange() which was made specifically for what you're trying to do. It's extremely fast and doesn't need Linq, casting, etc...

List<string> prodASINs = dc.aboProducts.Select(a => a.asin).ToList(); 

for(int i = 0; i < prodASINs.Count; i += 10)
{

    List<string> buffer = prodASINs.GetRange(i, 10);
    // do something with buffer
}

That's it. Very simple.

Test results: GetRange vs. Slice vs. Linq with 5000 strings in List<string> As you can clearly see, the Skip/Take approach using Linq is over 383 times slower than Slice<T>() and 4,736 times slower than GetRange()

==================================================================================

GetRange took on average 168 ticks
Slice took on average 2073 ticks
Linq took on average 795643 ticks

Test method used (try it yourself):

private static void GetRangeVsSliceVsLinq()
{
    List<string> stringList = new List<string>();
    for (int i = 0; i < 5000; i++)
    {
        stringList.Add("This is a test string " + i.ToString());
    }

    Stopwatch sw = new Stopwatch();

    long m1 = 0, m2 = 0, m3 = 0;


    for (int x = 0; x < 10; x++)
    {
        Console.WriteLine("Iteration {0}", x + 1);
        Console.WriteLine();

        sw.Reset();
        sw.Start();

        for (int i = 0; i < stringList.Count; i += 10)
        {
            List<string> buffer = stringList.GetRange(i, 10);
        }
        sw.Stop();
        Console.WriteLine("GetRange took {0} msecs", sw.ElapsedMilliseconds);
        Console.WriteLine("GetRange took {0} ticks", sw.ElapsedTicks);
        m1 += sw.ElapsedTicks;

        sw.Reset();
        sw.Start();

        string[] sliceArray = stringList.ToArray();
        for (int i = 0; i < sliceArray.Length; i += 10)
        {
            List<string> buffer = sliceArray.Slice(i, 10).ToList();
        }
        sw.Stop();
        Console.WriteLine("Slice took {0} msecs", sw.ElapsedMilliseconds);
        Console.WriteLine("Slice took {0} ticks", sw.ElapsedTicks);
        m2 += sw.ElapsedTicks;

        sw.Reset();
        sw.Start();

        var count = stringList.Count();
        for (var i = 0; i < count; i++)
        {
            var buffer = stringList.Skip(i * 10).Take(10).ToList();
        }

        sw.Stop();
        Console.WriteLine("Skip/Take took {0} msecs", sw.ElapsedMilliseconds);
        Console.WriteLine("Skip/Take took {0} ticks", sw.ElapsedTicks);
        m3 += sw.ElapsedTicks;

        Console.WriteLine();
    }

    Console.WriteLine();
    Console.WriteLine("GetRange took on average {0} ticks", m1 / 10);
    Console.WriteLine("Slice took on average {0} ticks", m2 / 10);
    Console.WriteLine("Linq took on average {0} ticks", m3 / 10);

}

edited Jul 22 '12 at 17:05

answered Jul 22 '12 at 13:50

Chris Gessler

22,727
7
57
83

"It's extremely fast and doesn't need Linq, casting, etc..." - which is great... if you're working with lists. However, considering that this linq may, for example, be working with a DB driver with fast 'skip' and the act of accessing data might be the bottleneck - in which case you example may perform poorly. I note this because I was down-voted without reason at the same time you decided to post. – NPSF3000 Jul 22 '12 at 14:20
"which is great... if you're working with lists" - `List prodASIN` is a list, which is what this answer addresses. If the bottleneck is in the database layer, any attempt to retrieve the data will exibit the same performance issues and the only way around it would be to retrieve the data asynchronously, for example with PLINQ. With Parallelism the blocks of data could be retrieved simultaneously and processed immediately as each block is returned (provided that the blocks can be processed out of order) or wait until all blocks are returned, which would still be faster. – Chris Gessler Jul 22 '12 at 15:26
"List prodASIN is a list," prodASIN is the *RESULT* of the query. For example: Let us suppose that one element is 100MB in size [all we know is that it contains a string] and stored in a db or similar. Your version would first attempt to load all 1200 results [120GB] first, then split them. Others, such as mine, would request the data in say 1GB blocks. That's not to say your solution is bad, but that it has it's limitations - ones that don't appear in the down-voted answers. – NPSF3000 Jul 22 '12 at 15:38
prodASIN is `List` which makes it a List. The OP could return something else and since GetRange is limted to `List`, my solution would not work. Considering the OP is loading all results, I'm assuming each record is not 100GB or the question would be different and require a different approach to memory management. From what I gather, the requirement allows only 10 records at a time to be uploaded. Your answer addresses an issue which isn't the issue using a method that's proven to be slower than other known solutions (i.e. [GroupBy](http://stackoverflow.com/a/6849364/555518) – Chris Gessler Jul 22 '12 at 16:19
"prodASIN is List which makes it a List. " - prodASIN is the result. The question clearly shows that the data source being queried is "dc.aboProducts". While the GroupBy is an interesting solution - I'm pretty sure that it wouldn't work some some Linq drivers - e.g. Mongodb. – NPSF3000 Jul 22 '12 at 16:21
prodASIN is not the result, the result is an IEnumerable (unnamed var). prodASIN is the **RETURN** value of `ToList()` – Chris Gessler Jul 22 '12 at 16:42
"prodASIN is not the result, the result is an IEnumerable (unnamed var). prodASIN is the RETURN value of ToList() " Exactly, and the split occurs in the query, BEFORE THE TOLIST. Never-mind, obviously we are not going to come to an agreement. – NPSF3000 Jul 23 '12 at 03:47
Dude, I fully undersetand what your solution is doing and the Skip/Take is last, which means 1000 results are coming back from the DB, you Take 100 of those, then Skip/Take 10 and convert it to an Array. Why not just convert the 1000 items coming back to List, then use GetRange. Test it yourself if you want, my money would be on GetRange. – Chris Gessler Jul 23 '12 at 10:20
"Dude, I fully undersetand what your solution is doing and the Skip/Take is last, which means 1000 results are coming back from the DB". Nope - in my example only 100 items would be returned - and in 10 item increments. Why? Because *skip* and *take* are part of the query that gets send to the db! This flexibility is the whole point of LINQ! – NPSF3000 Jul 23 '12 at 10:39
@NPSF3000 - I tested your query in Linqpad and discovered that adding .Slice(10) executed the same query 21 (select top 100 ...) to retrieve 100 records and split it. I think it would be more expensive to execute the same query 21 times than simply taking the result of the first top 100 and using GetRange. – Chris Gessler Jul 23 '12 at 12:01
" I think it would be more expensive to execute the same query 21 times than simply taking the result of the first top 100 and using GetRange" And that's where you're wrong - you've made an assumption that I clearly stated my version did not make. You *assume* that the query is expensive, but that the storing of objects is cheap. Go learn up on the basic concepts then get back to me. – NPSF3000 Jul 23 '12 at 12:14
Executing the same query 21 times to retrieve a different set within the same results is never efficient. It would be better to use Skip/Take against the first result (top 100), but at that point, and considering the 'chunks' need to be a List, it would be better to call ToList and use GetRange. – Chris Gessler Jul 23 '12 at 12:26
"Executing the same query 21 times to retrieve a different set within the same results is never efficient." Wrong. – NPSF3000 Jul 23 '12 at 12:59
Really? Running "select top 100 from table" 21 times is more efficient than running it once?? Ever heard of network latency and bandwidth? Maybe you need to rethink your career path. – Chris Gessler Jul 23 '12 at 13:04
Running say "select 10" 10 times is more efficient than, say, taking all 100 records that a) may not physically fit into memory, b) may not be needed for the computation - so on and so forth. There are many considerations, you've picked on one or two and have completely forgotten the others. – NPSF3000 Jul 23 '12 at 13:18
@NPSF3000 - `Running say "select 10" 10 times is more efficient than, say, taking all 100 records`. This is not entirely true taking into account overhead from network latency, SQL server, etc... but in some cases, it might work to your advantage, except your LINQ statement returns the TOP 100 21 times, not the TOP 10 10 times. Download Linqpad and test it yourself against a database (as I did), then click the "SQL" tab and it will show you the 21 queries it ran. – Chris Gessler Jul 23 '12 at 16:26
Then, if that's the limitation, spend two seconds to re-write the query from 'Take(100).Split(10)' to '.Split(10).Take(10)' or if that doesn't work create a 'limit' operator '.Split(10).Limit(10)' etc. I'm happy you tried it in linqpad... but I do remind you that there are many different drivers for many different DB's - the ones I work with don't even use SQL. – NPSF3000 Jul 24 '12 at 00:27
YOU should spend the two seconds to re-write YOUR query. And it's not a 'limitation', you simply wrote it wrong, or don't understand how Linq works. My solution doesn't use the Skip/Take approach. If I decided to do this at the DB level, I would not use dynamic queries, instead I would write a stored proc and page the results. – Chris Gessler Jul 24 '12 at 09:43
My DB doesn't use stored proc's - what I'd do is skip(x).take(y)... just like my code! – NPSF3000 Jul 24 '12 at 10:30
Then you need to learn how to keep the Linq expression from executing multiple times in your extension, because that's what is happening. Using Skip/Take in the Linq expression is fine if that's what you have to do, but doing it in an extension like you did causes the linq expression to execute over and over which is really bad. – Chris Gessler Jul 24 '12 at 10:34
Ahh, you and your assumptions about how things work. Good luck! – NPSF3000 Jul 24 '12 at 14:10

Split results from DB into "chunks" of 10

5 Answers5

Linked