What's a good strategy for processing a queue in parallel?

Question

I'm writing a program which needs to recursively search through a folder structure, and would like to do so in parallel with several threads.

I've written the rather trivial synchronous method already - adding the root directory to the queue initially, then dequeuing a directory, queuing its subdirectories, etc., until the queue is empty. I'll use a ConcurrentQueue<T> for my queue, but have already realized that my loops will stop prematurely. The first thread will dequeue the root directory, and immediately every other thread could see that the queue is empty and exit, leaving the first thread as the only one running. I would like each thread to loop until the queue is empty, then wait until another thread queues some more directories, and keep going. I need some sort of checkpoint in my loop so that none of the threads will exit until every thread has reached the end of the loop, but I'm not sure the best way to do this without deadlocking when there really are no more directories to process.

Can you elaborate what do you search in folder structure? Just a fileName or something inside a file ? Or even something else? I think this is vital to help you with appropirate algorithm — Valentin Kuzub, May 08 '11 at 06:19
Are you doing this for reasons of performance? I looked at something similar for a process to read thousands of files and found that the controlling performance factor was Disk IO and processing in parallel threads did not noticeably affect performance. — ScruffyDuck, May 08 '11 at 06:28
I'm matching either the name of the directory or the files inside of it against a regular expression, then returning the results - I'm not matching anything inside of files. I might expand it to check file properties, but that's it. — dlras2, May 08 '11 at 06:28
@ScruffyDuck - Yes, I'm doing it for performance reasons, but as I said, I'm not actually opening the file, just testing its file names and properties. Should this make the parallelization more effective? — dlras2, May 08 '11 at 06:29
I would have thought that if you are doing minimal processing of each file then that would make using multiple threads less rather than more effective in imporving performance. It seems to me that the more processing you need to do for each file then the less would be the impact of System IO and the more useful using multiple threads to process the files would be. — ScruffyDuck, May 08 '11 at 08:07
Make sure you know about `Directory.EnumerateFiles()`. And if you need parallelism, TPL and BlockingCollection. — H H, May 08 '11 at 10:23

score 5 · Accepted Answer · answered May 08 '11 at 06:44

5

Use the Task Parallel Library.

Create a Task to process the first folder. In this create a Task to process each subfolder (recursively) and a task for each relevant file. Then wait on all the tasks for this folder.

The TPL runtime will make use of the thread pool avoiding creating threads, which is an expensive operation. for small pieces of work.

Note:

If the work per file is trivial do it inline rather than creating another task (IO performance will be the limiting factor).
This approach will generally work best if blocking operations are avoided, but if IO performance is the limit then this might not matter anyway—start simple and measure.
Before .NET 4 much of this can be done with the thread pool, but you'll need to use events to wait for tasks to complete, and that waiting will tie up thread pool threads.¹

¹ As I understand it, in the TPL when waiting on tasks—using a TPL method—TPL will reuse that thread for other tasks until the wait is fulfilled.

answered May 08 '11 at 06:44

Richard

106,783
21
203
265

If I understand this correctly, I'm forgoing an explicit queue altogether and invoking actions, rather than queuing subdirectories? – dlras2 May 08 '11 at 06:52
@Daniel: Yes. Or rather replacing an explicit queue with an implicit one (the queue of work items in the thread pool). – Richard May 08 '11 at 07:12
The problem I ran into with this method is that I am not sure how to wait on all tasks. `WaitAll` needs a specific list of Tasks to wait on, which I won't have since each tasks creates more tasks. – dlras2 May 08 '11 at 18:19
@Daniel: each task waits on the tasks it creates directly, some of those tasks will be themselves waiting on tasks -- think of a tree where each node waits on its direct children recursively. – Richard May 08 '11 at 20:40

score 2 · Answer 2 · answered May 08 '11 at 08:24

2

If you want to stick to the concept of an explicit queue have a look on the BlockingCollection class. The method GetConsumingEnumerable() returns a IEnumerable which blocks, when the collection has run out of items and continues as soon new items are available. This means whenever the collection is empty the thread is blocked and thus prevents a premature stop of it.

However: Basically this is very useful for producer-consumer scenarios. I am not sure if your problem falls into this category.

answered May 08 '11 at 08:24

Theo Lenndorff

4,556
3
28
43

This is a good find, but I'm having a little trouble adapting it to my many-producer-many-consumer scenario, as no producer can definitively say it's finished as long as other producers are producing. – dlras2 May 08 '11 at 18:42

IAmTimCorey · Answer 3 · 2011-05-08T06:41:55.767

1

It would seem like in this case that your best bet would be to create one thread to start, then whenever you load sub-directories, you should task threads from the thread pool to handle them. Allow your threads to exit when they are done and call new ones from the pool every time you go one step further into the directories. This way there is no deadlock and your system uses threads as it needs them. You could even specify how many threads to start based upon how many folders were found.

Edit: Changed the above to be more clear that you don't want to explicitly create new threads but instead you want to take advantage of the thread pool to add and remove threads as needed without the overhead.

edited May 08 '11 at 06:41

answered May 08 '11 at 06:22

IAmTimCorey

16,412
5
39
75

1

Explicitly creating threads is almost always the wrong answer: either thread pool or Tasks will work better. – Richard May 08 '11 at 06:34
I thought about this, but would the overhead of creating all the threads outweigh the benefits? Matching a file or directory isn't hard work, there just might be a lot of them to get through. – dlras2 May 08 '11 at 06:35
@Richard - Yes, I would use a thread pool and pull from it when needed. I was thinking that when I posted this but I guess I wasn't clear. No, you wouldn't want to create new threads from scratch but instead you would want to pull them out of the thread pool and let them go back in when they are done. – IAmTimCorey May 08 '11 at 06:39

What's a good strategy for processing a queue in parallel?

3 Answers3

Linked