I have been reading quite a lot about Parallel .net 4 and I have to say that I am a bit confused when to use it.
This is my common scenario I have been given a task to migrate lots of xml files to a database.
I typically I have to
- Read Xml Files (100.000) and more and order them numerically (each file is named 1.xml, 2.xml etc.).
- Save to a database.
I thought the above was a perfect candidate for parallel programming.
Conceptually I would like to process many files at a times.
I am currently doing this:
private ResultEventArgs progressResults=new ResultEventArgs();
public void ExecuteInParallelTest()
{
var sw=new Stopwatch();
sw.Start();
int index = 0;
cancelToken = new CancellationTokenSource();
var parOpts = new ParallelOptions();
parOpts.CancellationToken = cancelToken.Token;
parOpts.MaxDegreeOfParallelism = Environment.ProcessorCount; //It this correct?
FileInfo[] files = myDirectory.EnumerateFiles("*.xml").ToArray();//Is this faster?
TotalFiles = files.Count();
try
{
Task t1 = Task.Factory.StartNew(() =>
{
try
{
Parallel.ForEach(files, parOpts, (file, loopState) =>
{
if (cancelToken.Token.IsCancellationRequested)
{
cancelToken.Token.ThrowIfCancellationRequested();
}
index = Interlocked.Increment(ref index);
ProcessFile(file,index);
progressResults.Status=InProgress
OnItemProcessed(TotalFiles,index,etc..);
});
}
catch (OperationCanceledException ex)
{
OnOperationCancelled(new progressResults
{
progressResults.Status=InProgress
progressResults.TotalCount = TotalFiles;
progressResults.FileProcessed= index;
//etc..
});
}
//ContinueWith is used to sync the UI when task completed.
}, cancelToken.Token).ContinueWith((result) => OnOperationCompleted(new ProcessResultEventArgs
{
progressResults.Status=InProgress
progressResults.TotalCount = TotalFiles;
progressResults.FileProcessed= index;
//etc..
}), new CancellationTokenSource().Token, TaskContinuationOptions.None, TaskScheduler.FromCurrentSynchronizationContext());
}
catch (AggregateException ae)
{
//TODO:
}
}
My Questions: I am using .net 4.0 Is using Parallel the best/simpler way to speed up the processing of these files. Is the above psudo code good enough or Am I missing vital stuff,locking etc...
The most important question is: Forgetting the "ProcessFile" as I cannot optmize that as I have no control Is there room for optmisation
Should I partition the files in chunks eg 1-1000 - 1001-2000-2001-3000 would that improve performance (how do you do that)
Many thanks for any replies or link/code snippet that can help me understand better how I can improve the above code.