9

I've already build a recursive function to get the directory size of a folder path. It works, however with the growing number of directories I have to search through (and number of files in each respective folder), this is a very slow, inefficient method.

static string GetDirectorySize(string parentDir)
{
    long totalFileSize = 0;

    string[] dirFiles = Directory.GetFiles(parentDir, "*.*", 
                            System.IO.SearchOption.AllDirectories);

    foreach (string fileName in dirFiles)
    {
        // Use FileInfo to get length of each file.
        FileInfo info = new FileInfo(fileName);
        totalFileSize = totalFileSize + info.Length;
    }
    return String.Format(new FileSizeFormatProvider(), "{0:fs}", totalFileSize);
}

This is searches all subdirectories for the argument path, so the dirFiles array gets quite large. Is there a better method to accomplish this? I've searched around but haven't found anything yet.

Another idea that crossed my mind was putting the results in a cache and when the function is called again, try and find the differences and only re-search folders that have changed. Not sure if that's a good thing either...

nemesv
  • 138,284
  • 16
  • 416
  • 359
ikathegreat
  • 2,311
  • 9
  • 49
  • 80
  • 1
    This is a far more complicated question then you would imagine. I'd suggest calling into a win32 api method for something like this. – asawyer Mar 22 '12 at 22:44
  • http://stackoverflow.com/q/128618/284240 – Tim Schmelter Mar 22 '12 at 22:45
  • Look through this parallel solution http://stackoverflow.com/questions/2979432/directory-file-size-calculation-how-to-make-it-faster – Sergey Berezovskiy Mar 22 '12 at 22:59
  • 2
    Array size is pretty irrelevant, the cost is 99.9% hitting the disk. You'll have to pay at least once, you can get incremental updates after that from FileSystemWatcher. – Hans Passant Mar 22 '12 at 23:04

5 Answers5

26

You are first scanning the tree to get a list of all files. Then you are reopening every file to get its size. This amounts to scanning twice.

I suggest you use DirectoryInfo.GetFiles which will hand you FileInfo objects directly. These objects are pre-filled with their length.

In .NET 4 you can also use the EnumerateFiles method which will return you a lazy IEnumable.

fat
  • 6,435
  • 5
  • 44
  • 70
usr
  • 168,620
  • 35
  • 240
  • 369
  • They are not pre-filled, it is still a round trip to the disk. Necessarily so, you don't want stale data. And the reason that EnumerateFiles got added in .NET 4. – Hans Passant Mar 22 '12 at 23:02
  • At least in .NET 4 they *are* pre-filled. It happens in FileInfoResultHandler.CreateObject calling FileInfo.InitializeFrom calling PopulateFrom(WIN32_FIND_DATA). Please revert your downvote, this answer is correct. – usr Mar 22 '12 at 23:07
  • 1
    It wasn't my vote. Leaving a comment *and* downvoting is not a healthy strategy :) – Hans Passant Mar 22 '12 at 23:14
  • @HansPassant "The value of the Length property is pre-cached if the current instance of the FileInfo object was returned from any of the following DirectoryInfomethods: ... EnumerateFiles" http://msdn.microsoft.com/en-us/library/system.io.fileinfo.length.aspx +1 usr – paparazzo Mar 22 '12 at 23:29
13

This is more cryptic but it took about 2 seconds for 10k executions.

    public static long GetDirectorySize(string parentDirectory)
    {
        return new DirectoryInfo(parentDirectory).GetFiles("*.*", SearchOption.AllDirectories).Sum(file => file.Length);
    }
MrFox
  • 4,852
  • 7
  • 45
  • 81
12

Try

        DirectoryInfo DirInfo = new DirectoryInfo(@"C:\DataLoad\");
        Stopwatch sw = new Stopwatch();
        try
        {
            sw.Start();
            Int64 ttl = 0;
            Int32 fileCount = 0;
            foreach (FileInfo fi in DirInfo.EnumerateFiles("*", SearchOption.AllDirectories))
            {
                ttl += fi.Length;
                fileCount++;
            }
            sw.Stop();
            Debug.WriteLine(sw.ElapsedMilliseconds.ToString() + " " + fileCount.ToString());
        }
        catch (Exception Ex)
        {
            Debug.WriteLine(Ex.ToString());
        }

This did 700,000 in 70 seconds on desktop NON-RAID P4. So like 10,000 a second. On server class machine should get 100,000+ / second easy.

As usr (+1) said EnumerateFile is pre-filled with length.

paparazzo
  • 44,497
  • 23
  • 105
  • 176
4

You may start to speed up a little bit your function using EnumerateFiles() instead of GetFiles(). At least you won't load the full list in memory.

If it's not enough you should make your function more complex using threads (one thread per directory is too much but there is not a general rule).
You may use a fixed number of threads that peeks directories from a queue, each thread calculates the size of a directory and adds to the total. Something like:

  • Get the list of all directories (not files).
  • Create N threads (one per core, for example).
  • Each thread peeks a directory and calculate the size.
  • If there is not another directory in the queue the thread ends.
  • If there is a directory in the queue it calculates its size and so on.
  • Function finishes when all threads terminate.

You may improve a lot the algorithm spanning the search of directories across all threads (for example when a thread parse a directory it adds folders to the queue). Up to you to make it more complicated if you see it's too slow (this task has been used by Microsoft as an example for the new Task Parallel Library).

Adriano Repetti
  • 65,416
  • 20
  • 137
  • 208
  • +1. Note that threading and IO-bound tasks produce strange performance results - you have to prototype and measure. – Alexei Levenkov Mar 22 '12 at 23:04
  • Absolutely yes! I think it's more tricky to choose the right number of threads than to write the code to do it. I guess it depends a lot on disk performance for random access. Whatever I do to calculate it I can't be as fast as Windows, I imagine there's some trick...somewhere... – Adriano Repetti Mar 22 '12 at 23:08
  • Since this is IO bound I am not so sure that additional threads are going to buy much if anything. – paparazzo Mar 23 '12 at 00:06
  • It speeds up at least X2 using a pool of 4 thread compared with the same solution (EnumerateFiles) but without threads. This may vary a lot because of hardware. When Windows reads a block of data (the Directory) won't read just few bytes but a block and it'll keep it in cache. – Adriano Repetti Mar 23 '12 at 08:05
-1
long length = Directory.GetFiles(@"MainFolderPath", "*", SearchOption.AllDirectories).Sum(t => (new FileInfo(t).Length));
teo van kot
  • 12,350
  • 10
  • 38
  • 70
Shai Segev
  • 111
  • 1
  • 2