4

Here is the scenario:

I have a directory with 2+ million files. The code I have below writes out all the files in about 90 minutes. Does anybody have a way to speed it up or make this code more efficent? I'd also like to only write out the file names in the listing.

string lines = (listBox1.Items.ToString());
string sourcefolder1 = textBox1.Text;  
string destinationfolder = (@"C:\anfiles");  
using (StreamWriter output = new StreamWriter(destinationfolder + "\\" + "MasterANN.txt"))  
{  
    string[] files = Directory.GetFiles(textBox1.Text, "*.txt");  
    foreach (string file in files)  
    {  
        FileInfo file_info = new FileInfo(file);
        output.WriteLine(file_info.Name);  
    }  
 }  

The slow down is that it writes out 1 line at a time.

It takes about 13-15 minutes to get all the files it needs to write out.

The following 75 minutes is creating the file.

dplante
  • 2,445
  • 3
  • 21
  • 27
  • Similar: http://stackoverflow.com/questions/929276/how-to-recursively-list-all-the-files-in-a-directory-in-c/929277#929277 –  Dec 21 '09 at 16:48
  • 5
    it's not related to your question but don't do this: destinationfolder + "\\" + "MasterANN.txt instead use Path.Combine(destinationFolder, "MasterANN.txt") – albertein Dec 21 '09 at 16:48
  • Is it any quicker if you do this from the command line using dir?, e.g., "dir /b *.txt > c:\anfiles\MasterANN.txt". If so, you could shell out to dir (using the Process class). – Polyfun Dec 21 '09 at 17:01

5 Answers5

8

It could help if you don't make a FileInfo instance for every file, use Path.GetFileName instead:

string lines = (listBox1.Items.ToString());  
        string sourcefolder1 = textBox1.Text;  
        string destinationfolder = (@"C:\anfiles");  
        using (StreamWriter output = new StreamWriter(Path.Combine(destinationfolder, "MasterANN.txt"))  
        {  
            string[] files = Directory.GetFiles(textBox1.Text, "*.txt");  
            foreach (string file in files)  
            {  
                output.WriteLine(Path.GetFileName(file));
            }  
        }
albertein
  • 26,396
  • 5
  • 54
  • 57
6

You're reading 2+ million file descriptors into memory. Depending on how much memory you have you may well be swapping. Try breaking it up into smaller chunks by filtering on the file name.

Bill Barnes
  • 344
  • 1
  • 8
5

The first thing I would need to know is, where's the slow down? is it taking 89 minutes for Directory.GetFiles() to execute or is the delay spread out over the calls to FileInfo file_info = new FileInfo(file);?

If the delay is from the latter, you can probably speed things up by getting the file name from the path instead of creating an FileInfo instance to get the filename.

System.IO.Path.GetFileName(file);
Bennett Dill
  • 2,875
  • 4
  • 41
  • 39
  • its okay the FileInfo file_info = new FileInfo(File; output.WriteLine(file_info.Name); –  Dec 21 '09 at 16:51
3

From my experience, it's Directory.GetFiles that's slowing you down (aside from console output). To overcome this, P/Invoke into FindFirstFile/FindNextFile to avoid all the memory consumption and generall lagginess.

Anton Gogolev
  • 113,561
  • 39
  • 200
  • 288
1

Using Directory.EnumerateFiles do not need to load all the file names in to memory first. Check this out: C# directory.getfiles memory help

In your case, the code could be:

using (StreamWriter output = new StreamWriter(destinationfolder + "\\" + "MasterANN.txt"))
{
    foreach (var file in Directory.EnumerateFiles(sourcefolder, "*.txt"))
    {
        output.WriteLine(Path.GetFileName(file));
    }
}

From this doc, it said that:

The EnumerateFiles and GetFiles methods differ as follows: When you use EnumerateFiles, you can start enumerating the collection of names before the whole collection is returned; when you use GetFiles, you must wait for the whole array of names to be returned before you can access the array. Therefore, when you are working with many files and directories, EnumerateFiles can be more efficient.

So if you have sufficient memory, Directory.GetFiles is ok. But Directory.EnumerateFiles is much better when a folder contains millions of files.

Community
  • 1
  • 1
Jon
  • 1,211
  • 13
  • 29
  • Not only better but faster than Directory.GetFiles. Actually this is well known "trick" aka best answer (considering you don't want all that kludge with p/Invoke and messing with third party libraries). – Goujon Jun 15 '19 at 09:20