Multithreaded approach for large datasets

Question

I have a large data set of 51 classes (51 folders/directories) each class has 10 different instances (10 sub directories per directory) and each instance has 600 views (600 files of 10 MB each per sub directory).

I am using a jagged array of tasks to read those files in parallel i.e.

Task[][] threads = new Task[51][10];

More use of this can be found here at Jagged array of tasks - Concurrency Issues

Is there any approach better than this one because it calls for unforeseen bugs?

Edit: Posting code from referenced link in case that gets deleted

Task[][] threads = new Task[InstancesDir.Length][];
for (int i = 0; i < InstancesDir.Length; i++)
{
    threads[i] = new Task[InstancesDir[i].Length];
}
for (int i = 0; i < FilesDir.Length; i++)
        {
            for (int j = 0; j < FilesDir[i].Length; j++)
            {

                threads[i][j] = Task.Run(() =>
                {
                    Calculate(i, j, InstancesDir, FilesDir, PointSum);
                });


            }

            Task.WaitAll(threads[i]);
        }

That's > 3TB of data... reading all of this at once into an array is a pretty optimistic approach :-D — Rob, Feb 04 '15 at 09:59
Disk I/O might become a bottleneck. Memory might be an issue.You're properly better off only reading the chunks you need to handle when you need to handle them. — Allan S. Hansen, Feb 04 '15 at 09:59
Don't use multiple threads at all yet. Parallel processing only make sense if processing itself takes long and it make sense to optimize it. But I am sure, most of time you will be reading data (in fact processing may take less than 1% of time, do you really care to optimize that?). Do it in one thread (just one non UI thread), then you don't have problems left of how to store and manage threads, etc. — Sinatr, Feb 04 '15 at 10:41
@Robert I am reading just a stream and most of the files have useful information in first 1KB so i dispose the stream once required tags are processed — Muhammad Umar Farooq, Feb 07 '15 at 07:38

score 0 · Accepted Answer · answered Feb 05 '15 at 05:20

Frankly, it's not clear at all how you arrived at this design. Looking at the referenced post (you really should include all relevant details here...what happens if that other post gets renamed, or deleted?), it looks like you only ever wait on ten tasks at a time. So why bother storing all 510?

More to the point, your disk is only so fast. Assuming you are I/O bound (i.e. the calculations you do on the data are not extremely expensive), at best I would expect having two or three files at the most processed concurrently to be helpful (issuing concurrent I/O operations can help the disk I/O layer schedule the I/O operations on the hardware more efficiently).

Even if your computations are so expensive that the bottleneck is CPU, it won't help to have more concurrent operations than you have CPU cores.

Absent the useful details that would explain precisely what you're doing here, I'd say that the best thing would be to forget about processing the files concurrently. Do them sequentially and skip all the multithreading bugs.

If you know something about the processing that leads you to be sure that some concurrency is important, then you need to be more specific about that in your question. But even there, you should limit your concurrency; going beyond the degree of concurrency that is helpful can actually be harmful, as more and more threads wind up contending for the same bottleneck, causing costly overhead like thread context switching and I/O bus congestion.

Limiting to just 4 threads has actually improved the performance as compared to a single thread. — Muhammad Umar Farooq, Feb 07 '15 at 07:42

Multithreaded approach for large datasets

1 Answers1