Parallelizing data processing

Question

I'm trying to improve the runtime of some data processing I'm doing. The data starts out as various collections (Dictionary mostly, but a few other IEnumerable types), and the end result of processing should be a Dictionary<DataType, List<DataPoint>>.

I have all this working just fine... except it takes close to an hour to run, and it needs to run in under 20 minutes. None of the data has any connection to any other from the same collection, although they cross-reference other collections frequently, so I figured I should parallelize it.

The main structure of the processing has two levels of loops with some processing in between:

// Custom class, 0.01%
var primaryData= GETPRIMARY().ToDictionary(x => x.ID, x => x);

// Custom class, 11.30%
var data1 = GETDATAONE().GroupBy(x => x.Category)
                        .ToDictionary(x => x.Key, x => x);  

// DataRows, 8.19%
var data2 = GETDATATWO().GroupBy(x => x.Type)
                        .ToDictionary(x => x.Key, x => x.OrderBy(y => y.ID));

foreach (var key in listOfKeys)
{
   // 0.01%
   var subData1 = data1[key].ToDictionary(x => x.ID, x => x);

   // 1.99%
   var subData2 = data2.GroupBy(x => x.ID)
                       .Where(x => primaryData.ContainsKey(x.Type))
                       .ToDictionary(x => x.Key, x => ProcessDataTwo(x, primaryData[x.Key]));

   // 0.70%
   var grouped = primaryData.Select(x => new { ID = x.Key, 
                                               Data1 = subData1[x.Key],
                                               Data2 = subData2[x.Key] }).ToList();
   foreach (var item in grouped)
   {
       // 62.12%
       item.Data1.Results = new Results(item.ID, item.Data2);
       // 12.37%
       item.Data1.Status = new Status(item.ID, item.Data2);
   }
   results.Add(key, grouped);
}
return results;

listOfKeys is very small, but each grouped will have several thousand items. How can I structure this so that each call to item.Data1.Process(item.Data2) can get queued up and executed in parallel?

According to my profiler, all the ToDictionary() calls together take up about 21% of the time, the ToList() takes up 0.7%, and the two items inside the inner foreach together take up 74%. Hence why I'm focusing my optimization there.

I don't know if I should use Parallel.ForEach() to replace the outer foreach, the inner one, both, or if there's some other structure I should use. I'm also not sure if there's anything I can do to the data (or the structures holding it) to improve parallel access to it.

(Note that I'm stuck on .NET4, so don't have access to async or await)

Also if you are staring out doing parallel programming I reccomend reading the free E-Book from Microsoft "[Patterns of Parallel Programming](http://www.microsoft.com/en-us/download/details.aspx?id=19222)", it goes in to a good bit of detail of common pitfalls like doing too small or too large a unit of work (pages 26-28). — Scott Chamberlain, Jan 08 '14 at 22:05
@ScottChamberlain - I have run a profiler, and the biggest block, by far, is `item.Data1.Process()`. Inside there, there's little I can optimize - I've already trimmed/cached what I can. There's actually two steps inside the inner `foreach` in the real code, and together they take up 74% of the time this takes to run. — Bobson, Jan 08 '14 at 22:14
That being said, I don't know *how many* times it gets called, so the high numbers are at least partly a function of the sheer number of calls to it. I'll definitely give the Ebook a read, too. Thanks! — Bobson, Jan 08 '14 at 22:20
Well it would seem Parallel.foreach on the outer loop has got to be worth a try then. — Tony Hopkinson, Jan 08 '14 at 22:21
@TonyHopkinson - That's actually been running as I wrote the question. But it's been 25 minutes so far, which is actually *longer* than the single-threaded code took, so I'm questioning that result. I'll rerun the tests and add results. — Bobson, Jan 08 '14 at 22:24
It would be really handy to know what the data types are of the various containers. — spender, Jan 08 '14 at 22:32
@spender - `Dictionary`, `Dictionary>`, and `Dictionary>`. `listOfKeys` is an `IEnumerable`. — Bobson, Jan 08 '14 at 22:41
I said it was worth a try, I didn't say it would work. :) Too hard to do blind this. The line between paralysing and parallelising code is thin, very thin. :( For instance you mentioned the cross referenced other collections frequently, if that's each parallel thread referencing the same collections, instant bottleneck. You might be better off inverting the process and saying what data structures would I need to successfully parallelise this. — Tony Hopkinson, Jan 08 '14 at 22:41
@TonyHopkinson - Read access is blocking? That's something I didn't realize. I'd happily accept an answer based on alternate data structures. I can even delve into how the classes relate if it's necessary, I just figured that that was a higher level problem than that. — Bobson, Jan 08 '14 at 22:44
In the profiler how much % of time does the `subData1.Select(...` take. After I know that I can formulate a full answer for you, there are two approaches to take depending on if that step is slow or not. EDIT: you also said you are stuck on .NET 4, do you have access to Visual Studio 2012, or are you on 2010? If you are on 2012 you can add async/await to .NET 4.0 via [Microsoft.Bcl.Async](https://www.nuget.org/packages/Microsoft.Bcl.Async) — Scott Chamberlain, Jan 08 '14 at 22:55
@ScottChamberlain - There's actually a `.ToList()` after it in the code, which I missed when outlining it here. That shows up as 0.70%. There's also a `.ToDictionary()` in there which feeds data into it, which clocks in at about 10%. *EDIT:* We're on VS2010, but that package might enable me to talk my boss into upgrading... — Bobson, Jan 08 '14 at 23:02
Looks like all the data structures you are using are safe, as long as you don't do anything that would add or remove or reorder them. The common ones that is. Any you create internally will be okay to do what you like with. One of the classic parallelisation failures is, x threads all competing for the same resource at the same time, so contantly locking each other out. — Tony Hopkinson, Jan 08 '14 at 23:02
@ScottChamberlain - Done. I've brought it as close to the actual code as I can meaningfully, and added the percentages. — Bobson, Jan 08 '14 at 23:25
Hmm. So actually, the bulk of your time is spent in the inner loop. Unless you can figure out a way of reducing the number of items you process or optimize the `Results` constructor, I can't see a saving to be gained here. After all, the end-result requires a certain number of processed items in a dictionary, and the cost is constructing a Results object for each of them. — spender, Jan 08 '14 at 23:31
@spender - I expect to spend the same amount of *computational* time, but by threading it I should be able to make it happen in terms of real-world time. Hopefully. — Bobson, Jan 08 '14 at 23:35

spender · Answer 1 · 2014-01-08T23:34:48.777

EDIT

Given the time measurements provided after I wrote this answer, it appears that this approach was looking for savings in the wrong places. I'll leave my answer as a warning against optimization without measurement!!!

So, because of the nestedness of your approach, you are causing some unnecessary over-iteration of some of your collections leading to rather nasty Big O characteristics.

This can be mitigated by using the ILookup interface to pre-group collections by a key and to use these instead of repeated and expensive Where clauses.

I've had a stab at re-imagining your code to reduce complexity (but it is somewhat abstract):

var data2Lookup = data2.ToLookup(x => x.Type);
var tmp1 = 
    listOfKeys
        .Select(key => 
            new {
                key, 
                subData1 = data1[key], 
                subData2 = data2Lookup[key].GroupBy(x=>x.Category)
            })
        .Select(x => 
            new{
                x.key, 
                x.subData1, 
                x.subData2, 
                subData2Lookup = x.subData2.ToLookup(y => y.Key)});
var tmp2 = 
    tmp1
        .Select(x => 
            new{
                x.key, 
                grouped = x.subData1
                            .Select(sd1 => 
                                new{
                                    Data1 = sd1, 
                                    Data2 = subData2Lookup[sd1]
                                })
            });
var result =
    tmp2
        .ToDictionary(x => x.key, x => x.grouped);

It seems to me that the processing is somewhat arbitrarily place midway through the building of results, but shouldn't affect it, right?

So once results is built, let's process it...

var items = result.SelectMany(kvp => kvp.Value);
for(var item in items)
{
    item.Data1.Process(item.Data2);
}

EDIT

I've deliberately avoided going parallel fttb, so if you can get this working, there might be further speedup by adding a bit of parallel magic.

I've added the profiler percentages. Took me a bit to figure out which `ToDictionary` call was which. Does that affect this at all? — Bobson, Jan 08 '14 at 23:28
+1 anyway, because it's good advice, and it certainly *could* have been the solution to my problem, since I hadn't originally mentioned I'd profiled it. — Bobson, Jan 09 '14 at 01:02

Scott Chamberlain · Accepted Answer · 2014-01-08T23:42:34.830

1

Based on the percentages you posted and you said that grouped was very large you would definitely benefit by only paralyzing the inner loop.

Doing it is fairly simple to do

   var grouped = primaryData.Select(x => new { ID = x.Key, 
                                               Data1 = subData1[x.Key],
                                               Data2 = subData2[x.Key] }).ToList();
   Parallel.ForEach(grouped, (item) => 
   {
       item.Data1.Results = new Results(item.ID, item.Data2);
       item.Data1.Status = new Status(item.ID, item.Data2);
   });

   results.Add(key, grouped);

This assumes that new Results(item.ID, item.Data2); and new Status(item.ID, item.Data2); are safe to do multiple initializations at once (the only concern I would have is if they access non-thread safe static resources internally, and even so a non thread safe constructor is a really bad design flaw)

There is one big cavat: This will only help if you are CPU bound. If Results or Status is IO bound (for example it is waiting on a database call or a file on the hard drive) doing this will hurt your performance instead of helping it. If you are IO bound instead of CPU bound the only options are to buy faster hardware, attempt optimize those two methods more, or use caching in memory if possible so you don't need to do the slow IO.

edited Jan 08 '14 at 23:42

answered Jan 08 '14 at 23:34

Scott Chamberlain

124,994
33
282
431

Hmm. I tried this, and I got out of memory errors from the threads. Which is interesting, since I don't when it's single-threaded. – Bobson Jan 09 '14 at 00:47
Ah ha!, I recommend looking at [this old question of mine](http://stackoverflow.com/questions/6977218/parallel-foreach-can-cause-a-out-of-memory-exception-if-working-with-a-enumera) when I ran in to the same issue with `Parallel.ForEach`. Try setting `MaxDegreeOfParallelism = Enviorment.ProcessorCount`. If you still are having problems try writing a custom partitioner to take smaller partitions (there is a sample one linked in the second answer in my linked question). – Scott Chamberlain Jan 09 '14 at 00:48
Interesting. And this is why I came to SO to post the question rather than just trial-and-error. I'll add that and give it a shot - I *thought* it was CPU-limited, but it certainly may not be. I'll let you know! – Bobson Jan 09 '14 at 00:52
Definitely an improvement. Only ~25 minutes to run the whole thing now, instead of 50. Still room for improvement, but I don't think more parallelization will help. Back to the profiler! – Bobson Jan 09 '14 at 18:27

Parallelizing data processing

2 Answers2

EDIT