1

I've been investigating TPL as means of quickly generating a large volume of files - I have about 10 million rows in a database, events which belong to patients, which I want to output into their own text file, in the location d:\EVENTS\PATIENTID\EVENTID.txt

I'm using a two nested Parallel.ForEach loops - the outer in which a list of patients is retrieved and the inner in which the events for a patient are retrieved and written to a file.

This is the code I'm using, it's pretty rough at the moment, as I'm just trying to get things working.

DataSet1TableAdapters.GetPatientsTableAdapter ta = new DataSet1TableAdapters.GetPatientsTableAdapter();
List<DataSet1.GetPatientsRow> Pats = ta.GetData().ToList();

List<DataSet1.GetPatientEventsRow> events = null;

string patientDir = null;

System.IO.DirectoryInfo di = new DirectoryInfo(txtAllEventsPath.Text);
di.GetDirectories().AsParallel().ForAll((f) => f.Delete(true));

//get at the patients
Parallel.ForEach(Pats
        , new ParallelOptions() { MaxDegreeOfParallelism = 8 }
        , patient =>
{
    patientDir = "D:\\Events\\" + patient.patientID.ToString();

    //Output directory
    Directory.CreateDirectory(patientDir);
    events = new DataSet1TableAdapters.GetPatientEventsTableAdapter().GetData(patient.patientID).ToList();


    if (Directory.Exists(patientDir))
    {
        Parallel.ForEach(events.AsEnumerable()
            , new ParallelOptions() { MaxDegreeOfParallelism = 8 }
            , ev =>
            {
                List<DataSet1.GetAllEventRow> anEvent = 
                    new DataSet1TableAdapters.GetAllEventTableAdapter();    

                File.WriteAllText(patientDir + "\\" + ev.EventID.ToString() + ".txt", ev.EventData);
            });
    }

});

The code I have produced works very quickly but produces an error after a few seconds (in which about 6,000 files are produced). The error produced is one of two types:

DirectoryNotFoundException: Could not find a part of the path 'D:\Events\PATIENTID\EVENTID.txt'.

Whenever this error is produced, the directory structure D:\Events\PATIENTID\ exists, as other files have been created within that directory. An if condition checks for the existence of D:\Events\PATIENTID\ before the second loop is entered.

The process cannot access the file 'D:\Events\PATIENTID\EVENTID.txt' because it is being used by another process.

When this error occurs, sometimes the indicated file exists or doesn't.

So, can anyone of any advice as to why these errors are being produced. I don't understand either, and as far I can see, it should just work (and indeed does, for a short while).

supermeerkat
  • 159
  • 14
  • 1
    Parallelizing code can be useful if code is fundamentally thread-safe and the bottleneck is cpu usage. Neither is true in your case. This code is execute-bound by the disk drive and you only have one. So you are not speeding up the code at all and only pay for the cost of risking threading race bugs. You are in fact slowing the code down, disk drives don't like to be jerked around. How this can get seriously out of hand is pretty visible in [this Q+A](https://stackoverflow.com/questions/25907829/why-is-parallel-foreach-much-faster-then-asparallel-forall-even-though-msdn). – Hans Passant Apr 08 '18 at 13:01
  • Thanks for the comment! – supermeerkat Apr 08 '18 at 13:28

1 Answers1

3

From MSDN:

Use the Parallel Loop pattern when you need to perform the same independent operation for each element of a collection or for a fixed number of iterations. The steps of a loop are independent if they don't write to memory locations or files that are read by other steps.

Parallel.For can speed up the processing of your rows by doing multi threading but it comes with a caveat that if it is not used correctly it will end with unexpected behavior of the program like the one you are having above.

The reason for following error :

DirectoryNotFoundException: Could not find a part of the path 'D:\Events\PATIENTID\EVENTID.txt'.

can be that the one thread goes to write and the directory is not there mean while the other thread creates that. Normally when doing parallelism there can be race conditions as we are doing multi-threading and if we don't use proper mechanics like locks or monitors then we end up with these kind of issues.

As you are doing file writing so multiple threads when trying to write to the same file end up with the error you have latter i.e.

The process cannot access the file 'D:\Events\PATIENTID\EVENTID.txt' because it is being used by another process.

as one thread is already writing to file so at that time other threads would fail to access the file for writing to it.

I would suggest to use a normal loop instead of parallelism here.

Ehsan Sajjad
  • 61,834
  • 16
  • 105
  • 160
  • Thanks for taking the time to reply. So you're suggesting not using parallelism at all? My reason for using it in the first place was to speed up the document generation time. Might using a bunch of BackgroundWorkers, say eight at time, be an alternative? – supermeerkat Apr 08 '18 at 13:26
  • parallelism is helpful where we have totally independent tasks, in your case it looks like you write to same file multiple times and multiple threads would be writing to same file, if yes, then parallelism won'r help here without locking. – Ehsan Sajjad Apr 08 '18 at 14:04
  • Ah! No I'm not writing to the same file multiple times - in the inner loop for the patient, I'm writing one file for each of the patient's events. – supermeerkat Apr 08 '18 at 15:48