3

The overall programs goal is to determine the size of main folders in directories. It works pretty well for small drives, but struggles for larger ones. It took over 3 hours for one of the drives that I absolutely need. This is a copy of my folder sizing program, I am using.

    public  double getDirectorySize(string p)
    {

        //get array of all file names
        string[] a = Directory.GetFiles(p, "*.*", SearchOption.AllDirectories);

        //calculate total bytes in loop
        double b = 0;
        foreach (string name in a)
        {

            if (name.Length < 250) // prevents path too long errors
            {


                    //use file info to get length of each file 
                    FileInfo info = new FileInfo(name);
                    b += info.Length;
            }
        }

        //return total size
        return b;
    }

So what I was thinking of using parallel loops in a form of parallel foreach loops. Each p represents the main folder's name. I was thinking of somehow splitting path p into its subfolders and using parallel foreach loops to continue collecting file sizes; however, they have an unknown amount of subdirectories. This is where I am having problems with trying to get the folder size back. Thanks for the help in advance

Update

I call this function through this foreach loop below

           DirectoryInfo di = new DirectoryInfo    (Browse_Folders_Text_Box.Text);
            FileInfo[] parsedfilename = di.GetFiles("*.*", System.IO.SearchOption.TopDirectoryOnly);
            parsedfoldername = System.IO.Directory.GetDirectories(Browse_Folders_Text_Box.Text, "*.*", System.IO.SearchOption.TopDirectoryOnly);
            //parsedfilename = System.IO.Directory.GetDirectories(textBox1.Text, "*.*", System.IO.SearchOption.AllDirectories);





            // Process the list of folders found in the directory.

            type_label.Text = "Folder Names \n";


            List<string> NameList = new List<string>();
            foreach (string transfer2 in parsedfoldername)
            {

                this.Cursor = Cursors.WaitCursor;
                //Uses the path and takes the name from last folder used
                string dirName = new DirectoryInfo(@transfer2).Name;
                string dirDate = new DirectoryInfo(@transfer2).LastWriteTime.ToString();


                NameList.Add(dirName);
                //Form2 TextTable = new Form2(NameList.ToString());



                //Display_Rich_Text_Box.AppendText(dirName);
                //Display_Rich_Text_Box.AppendText("\n");
                Last_Date_Modified_Text_Box.AppendText(dirDate);
                Last_Date_Modified_Text_Box.AppendText("\n");


                try
                {
                    double b;

                    b = getDirectorySize(transfer2);
                    MetricByte(b);



                }
                catch (Exception)
                {
                    Size_Text_Box.AppendText("N/A \n");                      
                }

            }

            Display_Rich_Text_Box.Text = string.Join(Environment.NewLine, NameList);
            this.Cursor = Cursors.Default;

So what I was thinking when I thought of parallel foreach loops was to take the next instance names (subfolder name) that would be all on the same level and run them all at the same time with getDirectorySize() because I know there is at least 7 subfolders directly beneath the main folder name.

Tasha
  • 59
  • 1
  • 9
  • 3
    Have you thought of using a recursive search instead of getting all files on the entire drive all at once with `Directory.GetFiles`? I think it might be more efficient memory-wise than loading a huge result array containing possibly millions of entries. – rory.ap Jul 12 '16 at 17:55
  • Directory.GetFiles is getting the files for just p which is a folder. I call this function after using a foreach statement to get the main folder names in the drive. I considered a recursive function, but typically they are void functions and I still want the return of b (actual total size of the main folder). Could you maybe explain more what you mean? – Tasha Jul 12 '16 at 18:04
  • 1
    Also note that `Parallel` only benefits the CPU-bound portions of your program. It's possible that there's some CPU activity that could be parallelized, but I'd suspect that the vast majority of your time is spent in I/O which won;t benefit from parallelization. – D Stanley Jul 12 '16 at 18:04
  • @DStanley -- and memory allocation / page swapping. – rory.ap Jul 12 '16 at 18:05
  • @Tasha -- you don't need to have `void` for recursion. Or you could, and just use an output parameter which you keep adding to. Up to you. – rory.ap Jul 12 '16 at 18:05
  • http://stackoverflow.com/questions/468119/whats-the-best-way-to-calculate-the-size-of-a-directory-in-net ? – Rubens Farias Jul 12 '16 at 18:06
  • @roryap Are you saying that the program spends a lot of time in memory/page swapping or that those benefit from parallelization? – D Stanley Jul 12 '16 at 18:09
  • @DStanley -- no, spends a lot of time which is not helped by parallelization. – rory.ap Jul 12 '16 at 18:09
  • @roryap A recursive program always updates itself so unless you mean that you want to put the data in some kind of list wouldn't you lose it if you didn't output it to the screen? – Tasha Jul 12 '16 at 19:02
  • I'm talking about a recursive method that calls itself and either returns the result back up the call stack, summing it up along the way, or passes an output parameter as an argument each time and summing it up that way instead. – rory.ap Jul 12 '16 at 19:04
  • I get it. Thanks. That could be a possibility then – Tasha Jul 12 '16 at 19:05

4 Answers4

2

Parallel access to the same physical drive will not speed up the work.

Your main problem is the GetFiles method. It goes through all the subfolders collecting all file names. Then you pass in a loop on same files again.

Use the EnumerateFiles method instead.

Try this code. It will be much faster.

public long GetDirectorySize(string path)
{
    var dirInfo = new DirectoryInfo(path);
    long totalSize = 0;

    foreach (var fileInfo in dirInfo.EnumerateFiles("*.*", SearchOption.AllDirectories))
    {
        totalSize += fileInfo.Length;
    }
    return totalSize;
}

MSDN:

The EnumerateFiles and GetFiles methods differ as follows: When you use EnumerateFiles, you can start enumerating the collection of names before the whole collection is returned; when you use GetFiles, you must wait for the whole array of names to be returned before you can access the array. Therefore, when you are working with many files and directories, EnumerateFiles can be more efficient.

Alexander Petrov
  • 13,457
  • 2
  • 20
  • 49
  • why would enumerating the files be faster? Could you explain a little more? – Tasha Jul 12 '16 at 18:54
  • Well it is a little faster took it down from 3 hours to about 2 hours. Is there anything else we could do to maybe get rid around 30 min ? – Tasha Jul 13 '16 at 13:04
0

I've had to do something similar, though not for folder / file sizes.

I don't have the code handy, but I used the following as a starter. It executes in parallel if there are enough files in the directory

From the source on MSDN:

The following example iterates the directories sequentially, but processes the files in parallel. This is probably the best approach when you have a large file-to-directory ratio. It is also possible to parallelize the directory iteration, and access each file sequentially. It is probably not efficient to parallelize both loops unless you are specifically targeting a machine with a large number of processors. However, as in all cases, you should test your application thoroughly to determine the best approach.

   static void Main()
   {            
      try 
      {
         TraverseTreeParallelForEach(@"C:\Program Files", (f) =>
         {
            // Exceptions are no-ops.
            try {
               // Do nothing with the data except read it.
               byte[] data = File.ReadAllBytes(f);
            }
            catch (FileNotFoundException) {}
            catch (IOException) {}
            catch (UnauthorizedAccessException) {}
            catch (SecurityException) {}
            // Display the filename.
            Console.WriteLine(f);
         });
      }
      catch (ArgumentException) {
         Console.WriteLine(@"The directory 'C:\Program Files' does not exist.");
      }   

      // Keep the console window open.
      Console.ReadKey();
   }

   public static void TraverseTreeParallelForEach(string root, Action<string> action)
   {
      //Count of files traversed and timer for diagnostic output
      int fileCount = 0;
      var sw = Stopwatch.StartNew();

      // Determine whether to parallelize file processing on each folder based on processor count.
      int procCount = System.Environment.ProcessorCount;

      // Data structure to hold names of subfolders to be examined for files.
      Stack<string> dirs = new Stack<string>();

      if (!Directory.Exists(root)) {
             throw new ArgumentException();
      }
      dirs.Push(root);

      while (dirs.Count > 0) {
         string currentDir = dirs.Pop();
         string[] subDirs = {};
         string[] files = {};

         try {
            subDirs = Directory.GetDirectories(currentDir);
         }
         // Thrown if we do not have discovery permission on the directory.
         catch (UnauthorizedAccessException e) {
            Console.WriteLine(e.Message);
            continue;
         }
         // Thrown if another process has deleted the directory after we retrieved its name.
         catch (DirectoryNotFoundException e) {
            Console.WriteLine(e.Message);
            continue;
         }

         try {
            files = Directory.GetFiles(currentDir);
         }
         catch (UnauthorizedAccessException e) {
            Console.WriteLine(e.Message);
            continue;
         }
         catch (DirectoryNotFoundException e) {
            Console.WriteLine(e.Message);
            continue;
         }
         catch (IOException e) {
            Console.WriteLine(e.Message);
            continue;
         }

         // Execute in parallel if there are enough files in the directory.
         // Otherwise, execute sequentially.Files are opened and processed
         // synchronously but this could be modified to perform async I/O.
         try {
            if (files.Length < procCount) {
               foreach (var file in files) {
                  action(file);
                  fileCount++;                            
               }
            }
            else {
               Parallel.ForEach(files, () => 0, (file, loopState, localCount) =>
                                            { action(file);
                                              return (int) ++localCount;
                                            },
                                (c) => {
                                          Interlocked.Add(ref fileCount, c);                          
                                });
            }
         }
         catch (AggregateException ae) {
            ae.Handle((ex) => {
                         if (ex is UnauthorizedAccessException) {
                            // Here we just output a message and go on.
                            Console.WriteLine(ex.Message);
                            return true;
                         }
                         // Handle other exceptions here if necessary...

                         return false;
            });
         }

         // Push the subdirectories onto the stack for traversal.
         // This could also be done before handing the files.
         foreach (string str in subDirs)
            dirs.Push(str);
      }

      // For diagnostic purposes.
      Console.WriteLine("Processed {0} files in {1} milleseconds", fileCount, sw.ElapsedMilliseconds);
   }
William Xifaras
  • 5,212
  • 2
  • 19
  • 21
0

Unfortunately there's no hidden managed or Win32 API that would allow you to get the size of a folder on disk without recursing through it, otherwise Windows Explorer would definitely have taken advantage of it.

Here's a sample method that would parallelize the work which you could compare against a standard non-parallel recursive function to achieve the same:

private static long GetFolderSize(string sourceDir)
{
    long size = 0;
    string[] fileEntries = Directory.GetFiles(sourceDir);

    foreach (string fileName in fileEntries)
    {
        Interlocked.Add(ref size, (new FileInfo(fileName)).Length);
    }

    var subFolders = Directory.EnumerateDirectories(sourceDir);

    var tasks = subFolders.Select(folder => Task.Factory.StartNew(() =>
    {
        if ((File.GetAttributes(folder) & FileAttributes.ReparsePoint) != FileAttributes.ReparsePoint)
        {
            Interlocked.Add(ref size, (GetFolderSize(folder)));
            return size;
        }
        return 0;
    }));

    Task.WaitAll(tasks.ToArray());

    return size;
}

This example will not consume lots of memory unless you have millions of files inside a single folder.

Darin Dimitrov
  • 1,023,142
  • 271
  • 3,287
  • 2,928
0

Using the Microsoft Scripting Runtime seems to be about 90% faster:

var fso = new Scripting.FileSystemObject();
double size = fso.GetFolder(path).Size;

Reference: What is the fastest way to calculate a Windows folders size?

Community
  • 1
  • 1
Slai
  • 22,144
  • 5
  • 45
  • 53
  • Whats the library called I cant seem to find it – Tasha Jul 13 '16 at 18:43
  • It's in the COM tab. The Path is something like `C:\Windows\SysWOW64\scrrun.dll` – Slai Jul 13 '16 at 19:01
  • What makes it so much faster? – Tasha Jul 13 '16 at 19:14
  • I am guessing it does much less system calls to get information. It's more like using `FileInfo` is slower because it gets more information and might need a separate system call for each `FileInfo.Length` http://referencesource.microsoft.com/#mscorlib/system/io/fileinfo.cs,0ab84ec3507f6ed4 – Slai Jul 13 '16 at 19:58
  • I am also sure that there are much faster ways to do it in parallel because for example TreeSize Free gets the size of all files and folders on my C drive in less than a minute, but i don't know how. – Slai Jul 13 '16 at 20:14