I am quite new to concurrency (and C#, actually). I have a bunch of csv files in two separate directory to be read, and then I want to do some processing after I read a file. The processing is independent of other data read and process operations. After all the processing are done, I want to update the UI. The UI needs to be responsive at the mean time too because I will need to display a progress bar. Currently I have something like this:
private string _directoryA;
private string _directoryB;
// The user clicks the button
private void ButtonPressed()
{
Task.Run(() => DoJob());
}
private void DoJob()
{
var tasks = new List<Task>();
var watch = Stopwatch.StartNew();
tasks.Add(Task.Run(() => DoJobForDirectory(_directoryA).ContinueWith(t => Console.WriteLine("First Half");
tasks.Add(Task.Run(() => DoJobForDirectory(_directoryB).ContinueWith(t => Console.WriteLine("Second Half");
Task.WaitAll(tasks.ToArray());
watch.Stop();
Console.WriteLine($"Time Taken : {watch.ElapsedMilliseconds} ms.");
UpdateUI();
}
private void DoJobForDirectory(string directory)
{
var files = Directory.EnumerateFiles(directory, "*.csv");
var tasks = new List<Task>();
foreach (var file in files)
{
// Update the progress bar in the UI when a file has finished processing
tasks.Add(Task.Run(() => DoJobForFile(file)).ContinueWith(t => UpdateCounter++));
}
Task.WaitAll(tasks.ToArray());
}
private void DoJobForFile(string filePath)
{
ReadCSV();
ProcessData();
...
}
I feel like I am missing something here. From my reading this operation should be I/O bound, as the processing afterwards is pretty lightweight (some for loops and assignments). So I really should be using just async await, but not Task.Run()...? However I couldn't think of a better way to do this. The ReadCSV() is from some library that does not have the async version. Using Parallel.ForEach does not boost the performance too. Is there a better way to do this (to be efficient on resources and also achieve better performance)?
Also, when I tried to only run on one directory, the elapsed time would be nearly half of the time required for both directories. Since the operations are all independent, I want to run them all in parallel, so processing both directories should take roughly the same (or only slightly more) time as processing just single directory, but not two times slower. It seems like no matter how many Task.Run() I do, I will have a limited number of threads running at the same time (some bottleneck). I tried changing all the Task.Run() to be new Thread(), and observed much more threads were active at the same time, but in the end resulted worse performance. Why is that?