Efficient way to write text files in parallel

Question

There's a productivity tool being used in our department. Basically what it does is extract data from multiple excel files, make some data transformation, and export the corresponding output as text files.

I managed to have the copy of the source code and investigated how the text files are being generated. I found out that the developer created multiple BackgroundWorkers, one for each report that will be generated. It looks like this:

bgWorkerGenerateTextReport_1.RunWorkerAsync(); // inside the doWork method, it calls the actual method that generates the text file
bgWorkerGenerateTextReport_2.RunWorkerAsync();
bgWorkerGenerateTextReport_3.RunWorkerAsync();
bgWorkerGenerateTextReport_4.RunWorkerAsync();
bgWorkerGenerateTextReport_5.RunWorkerAsync();
bgWorkerGenerateTextReport_6.RunWorkerAsync();
// more bgWorkers follow...

At the completion of each backgroundWorker, it makes the linkLabel visible that corresponds to the location of the text file so the user can click it.

Some of the generated text files are very large (some contains almost a million rows and around 200 columns). I wanted to improve the tool since I have access to the source code.

First, I want to know what would be the better way to generate the text reports in parallel compared to declaring multiple backgroundWorkers. I know the original approach works, but I'm wondering what would be the more elegant and proper approach.

I tried directly calling the methods the generate the different reports but the UI became unresponsive while processing the files.

Maybe using threads? Instead of workers, threads could be more efficient. Still parallel tho. https://learn.microsoft.com/en-us/dotnet/api/system.threading.thread?view=net-5.0 — Sebastian Ciocarlan, Aug 12 '21 at 15:30
`async/await`, `Task.Run`, and `IProgress/Progress` are the newer ways of doing background work in a UI application, but if you're still at the "Things freeze if I call these methods directly" point, you might want to properly understand what's there (and how it works) — canton7, Aug 12 '21 at 15:31
In general backgroundWorker is obsolete and should be replaced with async/await (see https://stackoverflow.com/questions/12414601/async-await-vs-backgroundworker) - perhaps you can use Parallel.For or other TPL constructs BUT do read through this https://learn.microsoft.com/en-us/dotnet/standard/parallel-programming/potential-pitfalls-in-data-and-task-parallelism - especially for work that is heavily io constrained parallelism doesn’t automatically mean faster — auburg, Aug 12 '21 at 15:38

score 2 · Answer 1 · answered Aug 12 '21 at 15:34

2

Before attempting any optimization, I would benchmark the current IO and CPU utilization throughout the entire runtime of this operation. If those are close to saturation the whole time, no other tuning is likely to have a significant impact.

To make this process work as fast as possible (if that's the goal), you want to optimize the use of each resource involved. The downside of doing that is, anything else running on the same hardware may experience significant delays.

When doing this type of processing, I tend to use a Producer/Consumer pattern.

You might investigate having a multi-threaded producer that reads the files, feeding the data to a multi-threaded consumer to do the processing. You would then have a multi-threaded consumer of the processed data to write the results.

Read Data -> Transform Data -> Write Data

The number of threads in each layer should be tuned based on performance measurements. This allows you to tune your data transformation pipeline to make optimal use of available IO and CPU resources.

Channels are often (but not always) the best choice in .NET to create this type of pipeline.

answered Aug 12 '21 at 15:34

Eric J.

147,927
63
340
553

thanks for the answer, though probably advanced for me, but I'll try to understand and check – yonan2236 Aug 12 '21 at 15:44
Channels are surprisingly straightforward to use. Part of the intent in creating them is to put nice abstractions around multi-threading to make it both safer and easier for non-experts. Certainly something worth spending a little time learning. – Eric J. Aug 12 '21 at 15:46
I'm not sure I'd reach for pipelines at all here: that feels like a hammer in search of a nail. TPL Dataflow maybe, but even if the complexity of the task justified it. This feels like a much simpler problem – canton7 Aug 12 '21 at 16:06
@canton7: Depends to what degree you want to optimize the use of resources. I've implemented pipelines like this before when processing very large files and it makes a significant performance difference in such cases. But yes, there's a lot of added complexity. Be sure the gain is worth the effort before going down that road. – Eric J. Aug 12 '21 at 20:41

Efficient way to write text files in parallel

1 Answers1