4

I am trying to loop through all files and folders and perform an action on all files that have a certain extension. This method works fine, but I would like to make it multithreaded because when done over tens of thousands of files, it is really slow and I would imaging using multithreading would speed things up. I am just unsure about how to use threading in this case.

doStuff reads properties (date modified, etc. from the files and inserts them into a sqlite database. I am starting a transaction before the scan method is called so that is optimized as much as it can be.

Answers that provide the theory on how to do it are just as good as full working code answers.

    private static string[] validTypes = { ".x", ".y", ".z", ".etc" };
    public static void scan(string rootDirectory)
    {
        try
        {

            foreach (string dir in Directory.GetDirectories(rootDirectory))
            {

                if (dir.ToLower().IndexOf("$recycle.bin") == -1)
                    scan(dir);
            }

            foreach (string file in Directory.GetFiles(rootDirectory))
            {

                if (!((IList<string>)validTypes).Contains(Path.GetExtension(file)))
                {
                    continue;
                }


                doStuff(file);
            }
        }
        catch (Exception)
        {
        }
    }
Alec Gorge
  • 17,110
  • 10
  • 59
  • 71
  • This is off-topic, but you should never catch all exceptions. – Dirk Groeneveld Jul 20 '10 at 19:23
  • yes, I agree. I removed that part (because this is a winforms application) for simplicity – Alec Gorge Jul 20 '10 at 19:25
  • 3
    Why would you imagine that multithreading would speed anything up? **Threads do not magically make your disk run faster.** Threads can make your disk run *slower* because the disk controller now has more things to do at the same time. Can you explain why you think that a multithreaded solution would be faster? – Eric Lippert Jul 20 '10 at 19:30
  • 1
    @Eric: In practice, threading can speed up I/O. One reason is that the thread isn't doing I/O 100% of the time, so additional threads can fill the gap. Another is that I/O is that latency can lead to underutilization of the total bandwidth, whereas overlapping requests can fill the pipe fully. This is the theory: the practice is that it benchmarks much faster. – Steven Sudit Jul 20 '10 at 19:33
  • That makes sense, because when I am inserting into the sqlite database, I could start on the next file. – Alec Gorge Jul 20 '10 at 19:37
  • @Ramblingwood: In the end, don't take my word on it, or even Eric's or Dan's. Test it. Benchmarks don't lie. – Steven Sudit Jul 20 '10 at 19:42
  • Wow. Did you really just correct Eric Lippert on a C# question? – Matt Jordan Jul 20 '10 at 20:18
  • @Eric: He might have a fast SSD (with practically zero random access latency) and CPU-bound node processing at the same time, which would theoretically allow for some speedup using parallel code. – Alan Jul 20 '10 at 20:21
  • @ramblingwood : USe the enumeratefiles and enumeratedirectories instead of GEtfiles and getdirectories if you are using .NET 4.0. Theu directly return enumerables, so your code runs faster for large directories. See http://msdn.microsoft.com/en-us/library/dd997370.aspx – apoorv020 Jul 20 '10 at 20:26
  • @Matt: While I certainly respect Eric, particularly within his field, this is not a C# question, and his authority does not exceed that of my stopwatch. – Steven Sudit Jul 20 '10 at 21:35
  • Thanks for the comments. I am using .NET 4.0 and I am using an SSD. You can see what my final solution was in a my own answser. – Alec Gorge Jul 20 '10 at 21:46
  • I tried this before. As Eric said, if you're working on ONE disk, performance will likely DECREASE with multiple threads. – Mau Jul 21 '10 at 13:10
  • Benchmarking beats theorizing. – Steven Sudit Jul 21 '10 at 13:52
  • @Mau Yes--if the bottleneck is the harddrive. Mine was the database and by getting as many hard drive reads in as I could, the timed dropped massively. – Alec Gorge Jul 22 '10 at 16:46

5 Answers5

5

Assuming that doStuff is thread-safe, and that you don't need to wait for the entire scan to finish, you can call both doStuff and scan on the ThreadPool, like this:

string path = file;
ThreadPool.QueueUserWorkItem(delegate { doStuff(path); });

You need to make a separate local variable because the anonymous method would have capture the file variable itself, and would see changes to it throughout the loop. (In other words, if the ThreadPool only executed the task after the loop continued to the next file, it would process the wrong file)

However, reading your comment, the main issue here is disk IO, so I suspect that multithreading will not help much.

Note that Directory.GetFiles will perform slowly for directories with large numbers of files. (Since it needs to allocate an array to hold of the filenames)
If you're using .Net 4.0, you can make it faster by calling the EnumerateFiles method instead, which uses an iterator to return a IEnumerable<string> that enumerates the directory as you run your loop.
You can also avoid the recursive scan calls with either method by passing the SearchOption parameter, like this:

foreach (string file in Directory.EnumerateFiles(rootDirectory, "*", SearchOption.AllDirectories))

This will recursively scan all subdirectories, so you'll only need a single foreach loop.
Note that this will exacerbate the performance issues with GetFiles, so you may want to avoid this pre-.Net 4.0.

SLaks
  • 868,454
  • 176
  • 1,908
  • 1,964
  • This is the right way to go, but you might also want multiple threads doing the searching. – Steven Sudit Jul 20 '10 at 19:24
  • so even for just reading files, threading won't help? I think it is worth a shot trying this though. – Alec Gorge Jul 20 '10 at 19:27
  • @SLaks: Sorry, the way I understood it, you're suggesting that a single thread does the searching, queuing up `doStuff` for each file it finds. If so, I would suggest having multiple search threads. The idea would be to do your own recursion. As for disk I/O and multithreading, see my response to Eric. – Steven Sudit Jul 20 '10 at 19:35
  • hmm. the directories shouldn't have more that 20 or 30 files in them so that shouldn't be a problem. That last foreach thing you posted seems pretty interesting, and combined with Dan's comment below I guess I will have to drop multithreading all together. – Alec Gorge Jul 20 '10 at 19:36
  • As I said in my first sentence, `you can call both doStuff and scan on the ThreadPool` – SLaks Jul 20 '10 at 19:37
  • I think this be the best way to improve performance--even if it is just a little bit. Thanks also to Steven who made that very useful comment in response to Eric above. – Alec Gorge Jul 20 '10 at 19:39
  • @SLaks: Indeed you did. I was thrown off by your last sentence, though, where you recommended `SearchOption.AllDirectories`, which would remove the ability to parallelize the scan. – Steven Sudit Jul 20 '10 at 19:44
2

Using multithreading on IO operations is generally a bad call*. You may have multiple CPUs or a CPU with multiple cores; but generally, your hard disk cannot read or write to multiple files at the same time. This sort of thing typically needs to be serialized.

That said, it's a good practice to perform this kind of work on a thread that's separate from your UI thread. That way the UI remains responsive while your app is doing its heavy lifting.

*I am assuming your scan and doStuff methods are actually reading and/or writing data on the hard disk. If this is not the case, parallelizing this code might make sense after all.

Dan Tao
  • 125,917
  • 54
  • 300
  • 447
  • @Ramblingwood: Are you reading the *contents* of files, or just looking at the properties of paths and/or `DirectoryInfo`/`FileInfo` objects? Reading from the hard disk is also not possible to do multithreaded on most systems. – Dan Tao Jul 20 '10 at 19:28
  • I see. I was hoping to expand in the future to actually reading from the hard disk so I guess I shouldn't multithread this. – Alec Gorge Jul 20 '10 at 19:35
1

What exactly does doStuff and scan do? Unless they're very CPU intensive I would have thought that the disk access would be the bottleneck and that if anything making it multithreaded might be slower.

Hans Olsson
  • 54,199
  • 15
  • 94
  • 116
  • `doStuff` reads properties (date modified, etc.) from the files and inserts them into a sqlite database. I am starting a transaction before the scan method is called so that is optimized as much as it can be. – Alec Gorge Jul 20 '10 at 19:23
  • @Ramblingwood: You could try out having just 2 threads to begin with then, one that reads all the file details into memory and the other that uses that info to write to the DB. Then you could measure how much time each spends processing and make sure you optimize the right thing. – Hans Olsson Jul 20 '10 at 19:26
  • @Ramblingwood: Take a look at .NET 4.0's blocking queue (http://msdn.microsoft.com/en-us/library/dd267312(VS.100).aspx). The nice thing about a producer/consumer architecture is that you can easily change the number of threads on either end, for tuning. – Steven Sudit Jul 20 '10 at 19:53
1

On a side note, there's no need to cast validTypes to IList<string> because arrays implement IEnumerable<T> in .net 3.5+.

Secondly, validTypes might be better implemented as a HashSet, giving you O(1) lookup instead of O(n) with Contains. That said, this probably won't impact the performance in this case because your application is IO bound, as pointed out by the other answers.

Winston Smith
  • 21,585
  • 10
  • 60
  • 75
0

Thanks for everyone who responded. What I ended up going with was

        foreach (string file in Directory.EnumerateFiles(rootDirectory, "*", SearchOption.AllDirectories))
        {
            if (!((IList<string>)validTypes).Contains(Path.GetExtension(file)))
            {
                continue;
            }
            string path = file;
            ThreadPool.QueueUserWorkItem(delegate { doStuff(path); });
        }

This ran in about 2 minutes compared to the multiple hours that it was taking before. I think most of the lag was in the database, not the file IO.

Thanks so much everyone!

Alec Gorge
  • 17,110
  • 10
  • 59
  • 71