68

I have directory that contains nearly 14,000,000 audio samples in *.wav format.

All plain storage, no subdirectories.

I want to loop through the files, but when I use DirectoryInfo.GetFiles() on that folder the whole application freezes for minutes!

Can this be done another way? Perhaps read 1000, process them, then take next 1000 and so on?

Artjom B.
  • 61,146
  • 24
  • 125
  • 222
eddyuk
  • 4,110
  • 5
  • 37
  • 65
  • `DirectoryInfo.GetFiles()` is also horrible if you are using a network SAN. It locks all files and blocks others from accessing recently created SAN files. We never did find a non-blocking resolution. – SliverNinja - MSFT Oct 27 '11 at 05:30
  • if you are in a real perf critical spot I would also consider: http://stackoverflow.com/questions/724148/is-there-a-faster-way-to-scan-through-a-directory-recursively-in-net/724184#724184 – Sam Saffron Oct 30 '11 at 07:21

6 Answers6

95

Have you tried EnumerateFiles method of DirectoryInfo class?

As MSDN Says

The EnumerateFiles and GetFiles methods differ as follows: When you use EnumerateFiles, you can start enumerating the collection of FileInfo objects before the whole collection is returned; when you use GetFiles, you must wait for the whole array of FileInfo objects to be returned before you can access the array. Therefore, when you are working with many files and directories, EnumerateFiles can be more efficient.

Marc Gravell
  • 1,026,079
  • 266
  • 2,566
  • 2,900
Haris Hasan
  • 29,856
  • 10
  • 92
  • 122
  • My GetFiles method is only returning string, not FileInfo. – MrFox Sep 19 '15 at 16:00
  • @MrFox `string dir;` `Directory.GetFiles` / `Directory.EnumerateFiles` return string `new DirectoryInfo(dir).getFiles` / `new DirectoryInfo(dir).EnumerateFiles` return FileInfo – teamchong Feb 06 '17 at 02:18
47

In .NET 4.0, Directory.EnumerateFiles(...) is IEnumerable<string> (rather than the string[] of Directory.GetFiles(...)), so it can stream entries rather than buffer them all; i.e.

foreach(var file in Directory.EnumerateFiles(path)) {
    // ...
}
Marc Gravell
  • 1,026,079
  • 266
  • 2,566
  • 2,900
19

you are hitting the limitation of Windows file system itself. When number of files in a directory grows to a large number (and 14M is way beyond that threshold), accessing the directory becomes incredibly slow. It doesn't really matter if you read one file at a time or 1000, it's just directory access.

One way to solve this is to create subdirectories and break apart your files into groups. If each directory has 1000-5000 (guessing but you can experiment with actual numbers), then you should get decent performance opening/creating/deleting files.

This is why if you look at applications like Doxygen, which creates a file for every class, they follow this scheme and put everything into 2 levels of subdirectories which use random names.

DXM
  • 4,413
  • 1
  • 19
  • 29
  • +1, exactly so. I would add that it's better to do a DB solution, or use a file system suitable for large number of files; such as ReiserFS. I'm not sure if a ReiserFS driver is available for Windows or not. – Gleno Oct 23 '11 at 08:49
  • Best example is git which puts the objects in folders whose name is the first two letters of the SHA1 hash. – manojlds Oct 23 '11 at 10:24
  • @DXM - can you provide some references about this limitation? I'd always thought NTFS had no problems dealing with large directories (http://technet.microsoft.com/en-us/library/cc781134(WS.10).aspx talks about 300k files in a folder), but explorer was the big slow down. – ligos Oct 25 '11 at 02:57
  • @ligos - nope, take it as-is. I work with digital surveillance video recording. We have a lot of customers with a lot of data (biggest I've worked with was 1.5EB). A while ago one customer noticed that disk performance numbers didn't add up (and they paid a lot $$ for h/w) and after opening support case with Microsoft and hardware vendor, MS rep told us that we need to limit number of files per directory (we used to do same thing, dump everything in one folder). – DXM Oct 25 '11 at 03:11
  • 1
    @DXM - and the number they recommended you limit to was...?? Less than 5k as you recommend in your post? – ligos Oct 25 '11 at 03:36
8

Use Win32 Api FindFile functions to do it without blocking the app.

You can also call Directory.GetFiles in a System.Threading.Task (TPL) to prevent your UI from freezing.

Muhammad Hasan Khan
  • 34,648
  • 16
  • 88
  • 131
5

Enjoy.

    public List<string> LoadPathToAllFiles(string pathToFolder, int numberOfFilesToReturn)
    {
        var dirInfo = new DirectoryInfo(pathToFolder);
        var firstFiles = dirInfo.EnumerateFiles().Take(numberOfFilesToReturn).ToList();
        return firstFiles.Select(l => l.FullName).ToList();
    }
Jaryn
  • 446
  • 4
  • 16
3

I hit this issue of accessing large files in a single directory a lot of the time. Sub-directories are a good option, but soon even they don't offer much help sometimes. What I now do is create an Index file - a text file with names of all the files in the directory (provided I am creating files in that directory). I then read the index file and then open then actual file from the directory for processing

Faizul Hussain
  • 156
  • 1
  • 2
  • 7