I need to process a List<T>
of thousands of elements.
First I need to group the elements by year and type, so I obtain a List<List<T>>
. Then for each internal List<T>
I want to add objects of type T until the max package size is reached for the List<T>
, then I create a new package and go on the same way.
I want to use Parallel.ForEach
loop.
My actual implementation works well if I run it sequentially, but the logic is not Thread Safe and I want to change it.
I think the problem is on the inner Parallel.ForEach
loop, when the max size for the List<T>
is reached and I instantiate a new List<T>
inside the same reference.
private ConcurrentBag<ConcurrentBag<DumpDocument>> InitializePackages()
{
// Group by Type and Year
ConcurrentBag<ConcurrentBag<DumpDocument>> groups = new ConcurrentBag<ConcurrentBag<DumpDocument>>(Dump.DumpDocuments.GroupBy(d => new { d.Type, d.Year })
.Select(g => new ConcurrentBag<DumpDocument> (g.ToList()))
.ToList());
// Documents lists with max package dimension
ConcurrentBag<ConcurrentBag<DumpDocument>> documentGroups = new ConcurrentBag<ConcurrentBag<DumpDocument>>();
foreach (ConcurrentBag<DumpDocument> group in groups)
{
long currentPackageSize = 0;
ConcurrentBag<DumpDocument> documentGroup = new ConcurrentBag<DumpDocument>();
ParallelOptions options = new ParallelOptions { MaxDegreeOfParallelism = Parameters.MaxDegreeOfParallelism };
Parallel.ForEach(group, options, new Action<DumpDocument, ParallelLoopState>((DumpDocument document, ParallelLoopState state) =>
{
long currentDocumentSize = new FileInfo(document.FilePath).Length;
// If MaxPackageSize = 0 then no splitting to apply and the process works well
if (Parameters.MaxPackageSize > 0 && currentPackageSize + currentDocumentSize > Parameters.MaxPackageSize)
{
documentGroups.Add(documentGroup);
// Here's the problem!
documentGroup = new ConcurrentBag<DumpDocument>();
currentPackageSize = 0;
}
documentGroup.Add(document);
currentPackageSize += currentDocumentSize;
}));
if (documentGroup.Count > 0)
documentGroups.Add(documentGroup);
}
return documentGroups;
}
public class DumpDocument
{
public string Id { get; set; }
public long Type { get; set; }
public string MimeType { get; set; }
public int Year { get; set; }
public string FilePath { get; set; }
}
Since my operation is quite simple, actually I only need to get the file size using:
long currentDocumentSize = new FileInfo(document.FilePath).Length;
I read around that I can also use a Partitioner
, but I've never used that and anyway it's not my priority at the moment.
I also already read this question that is similar but doesn't solve my problem with the inner loop.
UPDATE 28/12/2016
I updated the code to meet verification requirements.