0

As we all know Directory.GetFiles() isn't very fast. I was trying to find some files as fast as possible. I came across some weired results.

I started with using Parallel.ForEach and iterating over all Directoryson my C:\-Drive.

I implemented 3 Methods and displayed the result. They vary in the count of Parallels I used to go deeper in the Directory.

Here's the result.

enter image description here

I don't get why using a single Parallel is faster than using two... What I don't get at all is why they differ in the count of files they found ?!

Long story short heres my code:

Caller

private void Start(object sender, EventArgs e)
    {
        new Thread(() =>
        {
            Stopwatch watch = new Stopwatch();
            watch.Start();
            List<FileInfo> files = HighPerformanceFileGettter.GetFilesInDirectory_DoubleParallel(@"C:\", "*.txt", SearchOption.AllDirectories);
            watch.Stop();
            MessageBox.Show($"Found [{files.Count}] files in [{watch.Elapsed.TotalMilliseconds}ms] => [{watch.Elapsed.TotalSeconds}]s", "Double Parallel");

        }).Start();
        new Thread(() =>
        {
            Stopwatch watch = new Stopwatch();
            watch.Start();
            List<FileInfo> files = HighPerformanceFileGettter.GetFilesInDirectory_SingleParallell(@"C:\", "*.txt", SearchOption.AllDirectories);
            watch.Stop();
            MessageBox.Show($"Found [{files.Count}] files in [{watch.Elapsed.TotalMilliseconds}ms] => [{watch.Elapsed.TotalSeconds}]s", "Single Parallel");

        }).Start();
        new Thread(() =>
        {
            Stopwatch watch = new Stopwatch();
            watch.Start();
            List<FileInfo> files = HighPerformanceFileGettter.GetFilesInDirectory_TripleParallel(@"C:\", "*.txt", SearchOption.AllDirectories);
            watch.Stop();
            MessageBox.Show($"Found [{files.Count}] files in [{watch.Elapsed.TotalMilliseconds}ms] => [{watch.Elapsed.TotalSeconds}]s", "Tripe Parallel");

        }).Start();
    }

Searching:

public static List<FileInfo> GetFilesInDirectory_TripleParallel(string rootDirectory, string pattern, System.IO.SearchOption option)
    {
        List<FileInfo> resultFiles = new List<FileInfo>();

        //Suchen:
        DirectoryInfo root = new DirectoryInfo(rootDirectory);
        if (root.Exists)
        {
            //Performance:
            Parallel.ForEach(root.GetDirectories(), (dir) =>
            {
                try
                {
                    Parallel.ForEach(dir.GetDirectories(), (dir_1) =>
                    {
                        try
                        {
                            Parallel.ForEach(dir_1.GetDirectories(), (dir_2) =>
                            {
                                try
                                {
                                    resultFiles.AddRange(dir_2.GetFiles(pattern, option));
                                }
                                catch (Exception) { }
                            }); 
                        }
                        catch (Exception) { }
                    });
                }
                catch (Exception) { }
            });
            return resultFiles;
        }
        Debug.Fail($"Root [{root.FullName}] does not exist");
        return null;
    }

NOTE : I just posted one of three Methods but you shout see what is different. Its only the count of Paralell.Foreach'es I used.

Does anyone have and Idea what the best term would be in terms of performance and why the filecount differs ?

Felix D.
  • 4,811
  • 8
  • 38
  • 72
  • 1
    If you can use the results one at a time as they are returned, then [Directory.EnumerateFiles](https://msdn.microsoft.com/en-us/library/system.io.directory.enumeratefiles%28v=vs.110%29.aspx) is likely to help. As for the differing file counts, have you checked for duplicates? – Andrew Morton Mar 04 '16 at 12:12
  • 3
    In terms of speed amongst other things disk I/O is a major limiting factor. You can have as many parallel processes as you like but the disk can only seek so fast. – Equalsk Mar 04 '16 at 12:16
  • @AndrewMorton even if there are duplicates shouldn't all threee fuctions find them ? – Felix D. Mar 04 '16 at 12:32
  • `List` isn't thread safe. That should be the reason for different results. Try using thread safe collection. – Dovydas Šopa Mar 04 '16 at 12:48
  • 2
    Using threads to read from a disk drive is a very, very bad idea. You only have one drive, it does *not* like to keep multiple threads happy. Disk seeks are by far the most expensive thing you can ever do with a drive. Especially dangerous because it does not repeat well, second time you run the program you'll read from memory instead of the disk. Cautionary tale is [available here](http://stackoverflow.com/a/25950320/17034). – Hans Passant Mar 04 '16 at 13:19
  • Thanks alot :D Again learned something today ^^ – Felix D. Mar 04 '16 at 13:24

1 Answers1

3

The reason for the different file counts is SearchOption.AllDirectories. In your single parallel version, if you read C:\ you get all files in all subdirectories ones. In your triple version you get

  • all files in all subdirectories below C:`
  • plus all files in all subdirectories below `C:\dir1´ (-> duplicates)
  • plus all files in all subdirectories below C:\dir1\dir2(-> new duplicates)

So all files in C:\ are added once, all files in C:\dir1 are added twice and all files in C:\dir1\dir2 are added three times.

For the measured times: Parallelization won't help if you're reading from a single hard drive. What causes latency here is the I/O system. Even a hundred threads would still have to wait for the hard drive. So I'd actually expect that parallelization would slow down the whole operation because of the overhead.

It's interesting that the triple version is fastest although it reads the directories more often, but that may be caused by caches of the file system. So you're test is not really correct anyway if you run all three tests directly one after another in one process.

René Vogt
  • 43,056
  • 14
  • 77
  • 99