6

I'd like to traverse a directory on my hard drive and search through all the files for a specific search string. This sounds like the perfect candidate for something that could (or should) be done in parallel since the IO is rather slow.

Traditionally, I would write a recursive function to finds and processes all files in the current directory and then recurse into all the directories in that directory. I'm wondering how I can modify this to be more parallel. At first I simply modified:

foreach (string directory in directories) { ... }

to

Parallel.ForEach(directories, (directory) => { ... }) 

but I feel that this might create too many tasks and get itself into knots, especially when trying to dispatch back onto a UI thread. I also feel that the number of tasks is unpredictable and that this might not be an efficient way to parallize (is that a word?) this task.

Has anyone successfully done something like this before? What advice do you have in doing so?

rein
  • 32,967
  • 23
  • 82
  • 106
  • +1 I'm glad Jon dispelled the theory because I too was thinking this would be a good candidate. Seems Microsoft cant even do it correctly either: http://msdn.microsoft.com/en-us/library/ff477033.aspx -> read the comment at the bottom :) – Jeremy Thompson May 31 '12 at 04:12

1 Answers1

15

No, this doesn't sound like a good candidate for parallelism precisely because the IO is slow. You're going to be diskbound. Assuming you've only got one disk, you don't really want to be making it seek to multiple different places at the same time.

It's a bit like trying to attach several hoses to the same tap in order to get water out faster - or trying to run 16 CPU-bound threads on a single core :)

Jon Skeet
  • 1,421,763
  • 867
  • 9,128
  • 9,194
  • That makes sense. Would it then be more beneficial to have one thread do all the disk-IO and multiple others parsing the files? – rein Nov 10 '10 at 23:05
  • 3
    @rein: If parsing the data takes any significant amount of time, then it might make sense to do that separately from synchronous reading, yes. However, if the IO is the most significant bottleneck, it might not actually gain you much - but make the code significantly more complex. You might look at having one thread doing synchronous IO and handing data to another thread to do all the parsing. Worth experimenting. – Jon Skeet Nov 10 '10 at 23:10