20

I have a base directory that contains several thousand folders. Inside of these folders there can be between 1 and 20 subfolders that contains between 1 and 10 files. I'd like to delete all files that are over 60 days old. I was using the code below to get the list of files that I would have to delete:

DirectoryInfo dirInfo = new DirectoryInfo(myBaseDirectory);
FileInfo[] oldFiles = 
  dirInfo.GetFiles("*.*", SearchOption.AllDirectories)
    .Where(t=>t.CreationTime < DateTime.Now.AddDays(-60)).ToArray();

But I let this run for about 30 minutes and it still hasn't finished. I'm curious if anyone can see anyway that I could potentially improve the performance of the above line or if there is a different way I should be approaching this entirely for better performance? Suggestions?

x0n
  • 51,312
  • 7
  • 89
  • 111
Abe Miessler
  • 82,532
  • 99
  • 305
  • 486
  • You could consider using multiple threads to speed things up; but do make sure they're not looking into the same directories and stuff. – aevitas Jul 19 '13 at 21:49
  • @Steve, it's been running about 5 mins with `EnumerateFiles` and it's still going unfortunately. – Abe Miessler Jul 19 '13 at 22:00
  • Sorry, I'm confused. Could you explain what you mean? – Abe Miessler Jul 19 '13 at 22:04
  • 15
    @aevitas: It's highly unlikely that multiple threads is going to make an improvement. The limiting factor is the speed of the disk, which can only do one thing at a time. Your multiple threads would spend most of their time waiting. – Jim Mischel Jul 19 '13 at 22:09
  • Have a look at the memory consumption, it might be huge with so many files and folders. When you leave out the ToArray() and process the results rightaway in a loop it doesn't have to build it all up in memory. – Richard Jul 19 '13 at 22:15
  • I have a c# program that does the exact same thing with about 12k subdirectories of financial data. After a long time experimenting with different techniques (including PLINQ and TPL) I discovered that raising a secondary command processor is (or seems to be) the fastest and most robust. – Gayot Fow Jul 19 '13 at 23:01
  • https://github.com/Wintellect/FastFileFinder (no clear license) – user423430 Jan 26 '18 at 16:02

7 Answers7

28

This is (probably) as good as it's going to get:

DateTime sixtyLess = DateTime.Now.AddDays(-60);
DirectoryInfo dirInfo = new DirectoryInfo(myBaseDirectory);
FileInfo[] oldFiles = 
    dirInfo.EnumerateFiles("*.*", SearchOption.AllDirectories)
           .AsParallel()
           .Where(fi => fi.CreationTime < sixtyLess).ToArray();

Changes:

  • Made the the 60 days less DateTime constant, and therefore less CPU load.
  • Used EnumerateFiles.
  • Made the query parallel.

Should run in a smaller amount of time (not sure how much smaller).

Here is another solution which might be faster or slower than the first, it depends on the data:

DateTime sixtyLess = DateTime.Now.AddDays(-60);
DirectoryInfo dirInfo = new DirectoryInfo(myBaseDirectory);
FileInfo[] oldFiles = 
     dirInfo.EnumerateDirectories()
            .AsParallel()
            .SelectMany(di => di.EnumerateFiles("*.*", SearchOption.AllDirectories)
                                .Where(fi => fi.CreationTime < sixtyLess))
            .ToArray();

Here it moves the parallelism to the main folder enumeration. Most of the changes from above apply too.

It'sNotALie.
  • 22,289
  • 12
  • 68
  • 103
22

A possibly faster alternative is to use WINAPI FindNextFile. There is an excellent Faster Directory Enumeration Tool for this. Which can be used as follows:

HashSet<FileData> GetPast60(string dir)
{
    DateTime retval = DateTime.Now.AddDays(-60);
    HashSet<FileData> oldFiles = new HashSet<FileData>();

    FileData [] files = FastDirectoryEnumerator.GetFiles(dir);
    for (int i=0; i<files.Length; i++)
    {
        if (files[i].LastWriteTime < retval)
        {
            oldFiles.Add(files[i]);
        }
    }    
    return oldFiles;
}

EDIT

So, based on comments below, I decided to do a benchmark of suggested solutions here as well as others I could think of. It was interesting enough to see that EnumerateFiles seemed to out-perform FindNextFile in C#, while EnumerateFiles with AsParallel was by far the fastest followed surprisingly by command prompt count. However do note that AsParallel wasn't getting the complete file count or was missing some files counted by the others so you could say the command prompt method is the best.

Applicable Config:

  • Windows 7 Service Pack 1 x64
  • Intel(R) Core(TM) i5-3210M CPU @2.50GHz 2.50GHz
  • RAM: 6GB
  • Platform Target: x64
  • No Optimization (NB: Compiling with optimization will produce drastically poor performance)
  • Allow UnSafe Code
  • Start Without Debugging

Below are three screenshots:

Run 1

Run 2

Run 3

I have included my test code below:

static void Main(string[] args)
{
    Console.Title = "File Enumeration Performance Comparison";
    Stopwatch watch = new Stopwatch();
    watch.Start();

    var allfiles = GetPast60("C:\\Users\\UserName\\Documents");
    watch.Stop();
    Console.WriteLine("Total time to enumerate using WINAPI =" + watch.ElapsedMilliseconds + "ms.");
    Console.WriteLine("File Count: " + allfiles);

    Stopwatch watch1 = new Stopwatch();
    watch1.Start();

    var allfiles1 = GetPast60Enum("C:\\Users\\UserName\\Documents\\");
    watch1.Stop();
    Console.WriteLine("Total time to enumerate using EnumerateFiles =" + watch1.ElapsedMilliseconds + "ms.");
    Console.WriteLine("File Count: " + allfiles1);

    Stopwatch watch2 = new Stopwatch();
    watch2.Start();

    var allfiles2 = Get1("C:\\Users\\UserName\\Documents\\");
    watch2.Stop();
    Console.WriteLine("Total time to enumerate using Get1 =" + watch2.ElapsedMilliseconds + "ms.");
    Console.WriteLine("File Count: " + allfiles2);


    Stopwatch watch3 = new Stopwatch();
    watch3.Start();

    var allfiles3 = Get2("C:\\Users\\UserName\\Documents\\");
    watch3.Stop();
    Console.WriteLine("Total time to enumerate using Get2 =" + watch3.ElapsedMilliseconds + "ms.");
    Console.WriteLine("File Count: " + allfiles3);

    Stopwatch watch4 = new Stopwatch();
    watch4.Start();

    var allfiles4 = RunCommand(@"dir /a: /b /s C:\Users\UserName\Documents");
    watch4.Stop();
    Console.WriteLine("Total time to enumerate using Command Prompt =" + watch4.ElapsedMilliseconds + "ms.");
    Console.WriteLine("File Count: " + allfiles4);


    Console.WriteLine("Press Any Key to Continue...");
    Console.ReadLine();
}

private static int RunCommand(string command)
{
    var process = new Process()
    {
        StartInfo = new ProcessStartInfo("cmd")
        {
            UseShellExecute = false,
            RedirectStandardInput = true,
            RedirectStandardOutput = true,
            CreateNoWindow = true,
            Arguments = String.Format("/c \"{0}\"", command),
        }
    };
    int count = 0;
    process.OutputDataReceived += delegate { count++; };
    process.Start();
    process.BeginOutputReadLine();

    process.WaitForExit();
    return count;
}

static int GetPast60Enum(string dir)
{
    return new DirectoryInfo(dir).EnumerateFiles("*.*", SearchOption.AllDirectories).Count();
}

private static int Get2(string myBaseDirectory)
{
    DirectoryInfo dirInfo = new DirectoryInfo(myBaseDirectory);
    return dirInfo.EnumerateFiles("*.*", SearchOption.AllDirectories)
               .AsParallel().Count();
}

private static int Get1(string myBaseDirectory)
{
    DirectoryInfo dirInfo = new DirectoryInfo(myBaseDirectory);
    return dirInfo.EnumerateDirectories()
               .AsParallel()
               .SelectMany(di => di.EnumerateFiles("*.*", SearchOption.AllDirectories))
               .Count() + dirInfo.EnumerateFiles("*.*", SearchOption.TopDirectoryOnly).Count();
}


private static int GetPast60(string dir)
{
    return FastDirectoryEnumerator.GetFiles(dir, "*.*", SearchOption.AllDirectories).Length;
}

NB: I concentrated on count in the benchmark not modified date.

Chibueze Opata
  • 9,856
  • 7
  • 42
  • 65
6

I realize this is very late to the party but if someone else is looking for this then you can speed things up by orders of magnitude by directly parsing the the MFT or FAT of the file system, this requires admin privileges as I think it will return all files regardless of security but can probably take your 30 mins down to 30 seconds for the enumeration stage at least.

A library for NTFS is here https://github.com/LordMike/NtfsLib there is also https://discutils.codeplex.com/ which I haven't personally used.

I would only use these methods for initial discovery of files over x days old and then verify them individual before deleting, it might be overkill but I'm cautious like that.

Matt
  • 1,436
  • 12
  • 24
3

The method Get1 in above answer (#itsnotalie & #Chibueze Opata) is missing to count the files in the root directory, so it should read:

private static int Get1(string myBaseDirectory)
{
    DirectoryInfo dirInfo = new DirectoryInfo(myBaseDirectory);
    return dirInfo.EnumerateDirectories()
               .AsParallel()
               .SelectMany(di => di.EnumerateFiles("*.*", SearchOption.AllDirectories))
               .Count() + dirInfo.EnumerateFiles("*.*", SearchOption.TopDirectoryOnly).Count();
}
pisker
  • 716
  • 6
  • 8
1

When using SearchOption.AllDirectories EnumerateFiles took ages to return the first item. After reading several good answers here, I have for now ended up with the function below. By only have it work on one directory at a time and calling it recursively it now returns first item almost immediately. But I must admit that I'm not totally sure on the correct way to use .AsParallel() so don't use this blindly.

Instead of working with arrays I would strongly suggest working with enumeration instead. Some mentions that speed of disk is limiting factor and threads won't help, in terms of total time that is very likely as long as nothing is cached by the OS, but by using multiple threads you can get the cached data returned first, while otherwise it might be possible that the cache is pruned to make space for the new results.

Recursive calls might affect stack, but there is a limit on most FSs for how many levels there can be, so should not become a real issue.

    private static IEnumerable<FileInfo> EnumerateFilesParallel(DirectoryInfo dir)
    {
        return dir.EnumerateDirectories()
            .AsParallel()
            .SelectMany(EnumerateFilesParallel)
            .Concat(dir.EnumerateFiles("*", SearchOption.TopDirectoryOnly).AsParallel());
    }
NiKiZe
  • 1,256
  • 10
  • 26
0

You are using a Linq. It would be faster if you wrote your own method for searching Directories recursively with you're special case.

public static DateTime retval = DateTime.Now.AddDays(-60);

public static void WalkDirectoryTree(System.IO.DirectoryInfo root)
{
    System.IO.FileInfo[] files = null;
    System.IO.DirectoryInfo[] subDirs = null;

    // First, process all the files directly under this folder 
    try
    {
        files = root.GetFiles("*.*");
    }
    // This is thrown if even one of the files requires permissions greater 
    // than the application provides. 
    catch (UnauthorizedAccessException e)
    {
        // This code just writes out the message and continues to recurse. 
        // You may decide to do something different here. For example, you 
        // can try to elevate your privileges and access the file again.
        log.Add(e.Message);
    }
    catch (System.IO.DirectoryNotFoundException e)
    {
        Console.WriteLine(e.Message);
    }

    if (files != null)
    {
        foreach (System.IO.FileInfo fi in files)
        {
          if (fi.LastWriteTime < retval)
          {
            oldFiles.Add(files[i]);
          }

            Console.WriteLine(fi.FullName);
        }

        // Now find all the subdirectories under this directory.
        subDirs = root.GetDirectories();

        foreach (System.IO.DirectoryInfo dirInfo in subDirs)
        {
            // Resursive call for each subdirectory.
            WalkDirectoryTree(dirInfo);
        }
    }            
}
Armen Aghajanyan
  • 368
  • 3
  • 15
0

If you really want to improve performance, get your hands dirty and use the NtQueryDirectoryFile that's internal to Windows, with a large buffer size.

FindFirstFile is already slow, and while FindFirstFileEx is a bit better, the best performance will come from calling the native function directly.

user541686
  • 205,094
  • 128
  • 528
  • 886