I'm trying to improve the runtime of some data processing I'm doing. The data starts out as various collections (Dictionary
mostly, but a few other IEnumerable
types), and the end result of processing should be a Dictionary<DataType, List<DataPoint>>
.
I have all this working just fine... except it takes close to an hour to run, and it needs to run in under 20 minutes. None of the data has any connection to any other from the same collection, although they cross-reference other collections frequently, so I figured I should parallelize it.
The main structure of the processing has two levels of loops with some processing in between:
// Custom class, 0.01%
var primaryData= GETPRIMARY().ToDictionary(x => x.ID, x => x);
// Custom class, 11.30%
var data1 = GETDATAONE().GroupBy(x => x.Category)
.ToDictionary(x => x.Key, x => x);
// DataRows, 8.19%
var data2 = GETDATATWO().GroupBy(x => x.Type)
.ToDictionary(x => x.Key, x => x.OrderBy(y => y.ID));
foreach (var key in listOfKeys)
{
// 0.01%
var subData1 = data1[key].ToDictionary(x => x.ID, x => x);
// 1.99%
var subData2 = data2.GroupBy(x => x.ID)
.Where(x => primaryData.ContainsKey(x.Type))
.ToDictionary(x => x.Key, x => ProcessDataTwo(x, primaryData[x.Key]));
// 0.70%
var grouped = primaryData.Select(x => new { ID = x.Key,
Data1 = subData1[x.Key],
Data2 = subData2[x.Key] }).ToList();
foreach (var item in grouped)
{
// 62.12%
item.Data1.Results = new Results(item.ID, item.Data2);
// 12.37%
item.Data1.Status = new Status(item.ID, item.Data2);
}
results.Add(key, grouped);
}
return results;
listOfKeys
is very small, but each grouped
will have several thousand items. How can I structure this so that each call to item.Data1.Process(item.Data2)
can get queued up and executed in parallel?
According to my profiler, all the ToDictionary()
calls together take up about 21% of the time, the ToList()
takes up 0.7%, and the two items inside the inner foreach
together take up 74%. Hence why I'm focusing my optimization there.
I don't know if I should use Parallel.ForEach()
to replace the outer foreach
, the inner one, both, or if there's some other structure I should use. I'm also not sure if there's anything I can do to the data (or the structures holding it) to improve parallel access to it.
(Note that I'm stuck on .NET4, so don't have access to async
or await
)