0

I have been reading quite a lot about Parallel .net 4 and I have to say that I am a bit confused when to use it.

This is my common scenario I have been given a task to migrate lots of xml files to a database.

I typically I have to

  1. Read Xml Files (100.000) and more and order them numerically (each file is named 1.xml, 2.xml etc.).
  2. Save to a database.

I thought the above was a perfect candidate for parallel programming.

Conceptually I would like to process many files at a times.

I am currently doing this:

private ResultEventArgs  progressResults=new ResultEventArgs();

public void ExecuteInParallelTest()
{
    var sw=new Stopwatch();
    sw.Start();
    int index = 0;
    cancelToken = new CancellationTokenSource();
    var parOpts = new ParallelOptions();
    parOpts.CancellationToken = cancelToken.Token;
    parOpts.MaxDegreeOfParallelism = Environment.ProcessorCount;  //It this correct?

    FileInfo[] files = myDirectory.EnumerateFiles("*.xml").ToArray();//Is this faster?
    TotalFiles = files.Count();
    try
    {
        Task t1 = Task.Factory.StartNew(() =>
        {
            try
            {
                Parallel.ForEach(files, parOpts, (file, loopState) =>
                {

                    if (cancelToken.Token.IsCancellationRequested)
                    {
                        cancelToken.Token.ThrowIfCancellationRequested();
                    }

                    index = Interlocked.Increment(ref index);

                    ProcessFile(file,index);

                                progressResults.Status=InProgress                                   

                    OnItemProcessed(TotalFiles,index,etc..);
                });
            }
            catch (OperationCanceledException ex)
            {
                OnOperationCancelled(new progressResults
                    {
                        progressResults.Status=InProgress                               
                        progressResults.TotalCount = TotalFiles;
                        progressResults.FileProcessed= index;
                        //etc..                                  
                    });

            }

            //ContinueWith is used to sync the UI when task completed.
        }, cancelToken.Token).ContinueWith((result) => OnOperationCompleted(new ProcessResultEventArgs
            {
                        progressResults.Status=InProgress
                        progressResults.TotalCount = TotalFiles;
                        progressResults.FileProcessed= index;
                        //etc..
            }), new CancellationTokenSource().Token, TaskContinuationOptions.None, TaskScheduler.FromCurrentSynchronizationContext());
    }
    catch (AggregateException ae)
    {
        //TODO:
    }
   }

My Questions: I am using .net 4.0 Is using Parallel the best/simpler way to speed up the processing of these files. Is the above psudo code good enough or Am I missing vital stuff,locking etc...

The most important question is: Forgetting the "ProcessFile" as I cannot optmize that as I have no control Is there room for optmisation

Should I partition the files in chunks eg 1-1000 - 1001-2000-2001-3000 would that improve performance (how do you do that)

Many thanks for any replies or link/code snippet that can help me understand better how I can improve the above code.

user9969
  • 15,632
  • 39
  • 107
  • 175
  • I would suggest pipeline this process, see [this SO post](http://stackoverflow.com/a/9895150/485076) – sll Jan 30 '13 at 11:27
  • 1
    I also would consider to not using threaind when you have IO Operations. Instead use the Async CTP and await which frees you up of unnecessary threads. Have a look at this great webcast http://channel9.msdn.com/Shows/AppFabric-tv/AppFabrictv-Threading-with-Jeff-Richter – Boas Enkler Feb 04 '13 at 12:33

2 Answers2

0

The reason you are not receiving responses is because your code is so horribly wrong. AsParallel() does not do anything for GetFiles(), files.Count() actually iterates the enumerable, so not only you read the files (or just the directory) twice, but doing Count() first, and then iterating through them later will read the files twice and could produce inconsistent counts if the directory is modified. It does not look like it's necessary to do Task.Factory.StartNew since it's your only task (which spawns parallel processing inside it). Parallel.ForEach will encapsulate all OperationCancelledException's into single AggregateException, but it will only do that after all parallel threads finish their work.

Andrei
  • 1,015
  • 1
  • 11
  • 19
  • @Andrej tanas Hi thanks for you comment!! Very valuable.That is why I posted the question to have feedback.Could you provide a code snippet of how you would refactor the code,because I am bit confused about some of your comments and how I would address the issues . For starters .I need the total count for reporting.As for the parallel code how would you improve it.Thanks – user9969 Feb 01 '13 at 20:56
  • @Andrej also what I find intriguing in your answer is this:You are saying that Count Iterates again. So I do I avoid iterating? Also you mention that GetFiles.AsParallel does nothing. Why? in my get files there is "directoryInfo.EnumerateFiles(pattern).ToArray();" – user9969 Feb 01 '13 at 21:13
  • See this: [link](http://stackoverflow.com/questions/168901/howto-count-the-items-from-a-ienumerablet-without-iterating) regarding IEnumerable.Count() extension method. If you are using Directory.GetFiles(), just don't use Count() method, use Length property of the returned string array. – Andrei Feb 02 '13 at 22:34
  • Good explanation of how AsParallel() should be used could be found here: [link](http://stackoverflow.com/questions/3789998/parallel-foreach-vs-foreachienumerablet-asparallel) – Andrei Feb 02 '13 at 22:36
  • @Andrej thanks for the link! but not really something I did not know but just to keep things in prospective the iteration of the files when I get the files and the count after that is a flash neglible and that is on 100.000.I have edit my code so that you can see in full.I use EnumerateFiles not getFiles and I was using the count and not the lenght can change to length.I have reflected on MS code and if there is a count it will return it otherwise iterate. – user9969 Feb 03 '13 at 08:05
  • @Andrej now regarding the main parallel function I am sorry but I dont understand what's wrong.Can you provide a code snippet rather than a link on how you would do it. The operation is simple :Get some files and modify them!.Task.TaskFactory.StartNew is needed to start on a new thread otherwise the UI WILL FREEZE.I need code snippet improvements no link with no substance.Thanks – user9969 Feb 03 '13 at 08:10
0

I left the code as it is as nobody provided me with a suitable answer

user9969
  • 15,632
  • 39
  • 107
  • 175