9

I am writing a program that should process many small files, say thousands or even millions. I've been testing that part on 500k files, and the first step was just to iterate a directory which has around 45k directories in it (including subdirs of subdirs, etc), and 500k small files. The traversal of all directories and files, including getting file sizes and calculating total size takes about 6 seconds . Now, if I try to open each file while traversing and close it immediately it looks like it never stops. In fact, it takes way too long (hours...). Since I do this on Windows, I tried opening the files with CreateFileW, _wfopen and _wopen. I didn't read or write anything on the files, although in the final implementation I'll need to read only. However, I didn't see a noticeable improvement in any of the attempts.

I wonder if there's a more efficient way to open the files with any of the available functions, whether it's C, C++ or Windows API, or the only more efficient way will be to read the MFT and read blocks of the disk directly, which I am trying to avoid?

Update: The application that I am working on is doing backup snapshots with versioning. So, it also has incremental backups. The test with 500k files is done on a huge source code repository in order to do versioning, something like a scm. So, all files are not in one directory. There are around 45k directories as well (mentioned above).

So, the proposed solution to zip the files doesn't help, because when the backup is done, that's when all files are accessed. Hence, I'll see no benefit from that, and it'll even incur some performance cost.

Amy
  • 1,814
  • 2
  • 23
  • 38
  • 2
    Does this question and answer help things? [how to make createfile as fast as possible](http://stackoverflow.com/questions/7430959) – Anya Shenanigans Jan 08 '15 at 16:36
  • I am doing this on SSD. The isse is with the opening/closing files – Amy Jan 08 '15 at 16:55
  • 4
    Show your code. Without seeing your code. It's entirely possibly your code is in an infinite loop, calling an API wrong, or perhaps performing adequately. But without your code, every suggestion will just be a conjecture or hypothesis. Also, 500,000 files are A LOT of files and I would expect that to be a very time consuming operation. **What are you really trying to do**? – selbie Jan 08 '15 at 17:01
  • The code is fine. It doesn't enter in a recursion, and finishes (although after very long time). It's using FindFirstFile/FindNextFile to traverse the files/directories. I was just doing a benchmark and it turns out that each file open/close takes about 5 ms. That's what i am trying to improve... – Amy Jan 08 '15 at 17:09
  • By the way, this is for a backup application and this is one use case. I cannot disclose further details regarding the application, but this chunk of code is fairly trivial, as I described it earlier. – Amy Jan 08 '15 at 17:22
  • As I recall, NTFS doesn't like single directories to contain huge numbers of files. The index gets large and fragmented. Consider distributing the files among a subtree. http://stackoverflow.com/questions/197162/ntfs-performance-and-large-volumes-of-files-and-directories – Adrian McCarthy Jan 08 '15 at 18:40
  • This question also has useful information: http://stackoverflow.com/questions/115882/how-do-you-deal-with-lots-of-small-files – Adrian McCarthy Jan 08 '15 at 18:44
  • Specific advice from Microsoft on enumerating a large number of small files in a directory: http://support.microsoft.com/kb/2539403 – Adrian McCarthy Jan 08 '15 at 18:47
  • @AdrianMcCarthy: NTFS doesn't care how many files are in one directory. The directory is a [B-Tree](https://en.wikipedia.org/wiki/B-tree) which is always efficiently accessed. It is more likely `Findfirst`/`Findnext` is not traversing it efficiently. Or the NTFS driver is trying to buffer too much. KB2539403 is about an application storing all the file enumeration information, not about the directory size. – wallyk Jan 08 '15 at 19:40
  • 2
    @wallyk: KB2539403 says "When individual folders contain large numbers of files (more than 50,000 files), performance issues may occur while enumerating the list of files. ... When an application enumerates the directory contents of a large folder, NTFS and cache manager are tasked with reading and processing large quantities of metadata to perform the enumeration." Yes, it's absolutely about single folders with large number of files. – Adrian McCarthy Jan 08 '15 at 20:51

5 Answers5

8

What you are trying to do is intrinsically difficult for any operating system to do efficiently. 45,000 subdirectories requires a lot of disk access no matter how it is sliced.

Any file over about 1,000 bytes is "big" as far as NTFS is concerned. If there were a way to make most data files less than about 900 bytes, you could realize a major efficiency by having the file data stored inside the MFT. Then it would be no more expensive to obtain the data than it is to obtain the file's timestamps or size.

I doubt there is any way to optimize the program's parameters, process options, or even the operating system's tuning parameters to make the application work well. You are faced with multi-hour operation unless you can rearchitect it in a radically different way.

One strategy would be to distribute the files across multiple computers—probably thousands of them—and have a sub-application on each process the local files, feeding whatever results to a master application.

Another strategy would be to re-architect all the files into a few larger files, like big .zip files as suggested by @felicepollano, effectively virtualizing your set of files. Random access to a 4000 GB file is inherently far more efficient and effective use of resources than accessing 4 billion 1 MB files. Also moving all the data into a suitable database manager (MySQL, SQL Server, etc.) would accomplish this and perhaps provide other benefits like easy searches and an easy archival strategy.

wallyk
  • 56,922
  • 16
  • 83
  • 148
  • Hey wallyk, could you take a look at [my question](https://stackoverflow.com/q/76185334/3995261)? It seems to be related, but is also presumably related to some sort of buffering + caching in Windows for NTFS – YakovL May 06 '23 at 13:32
3

An overhead of 5 to 20ms per file isn't abnormal for an NTFS volume with that number of files. (On a conventional spindled drive, you can't expect much better than that anyway, because it's on the same order as the head seek times. From this point on, I'll assume we're dealing with enterprise-class hardware, SSD and/or RAID.)

Based on my experiences, you can significantly increase throughput by parallelizing the requests, i.e., using multiple threads and/or processes. Most of the overhead appears to be per-thread, the system can open ten files at once nearly as quickly as it can open a single file by itself. I'm not sure why this is. You might need to experiment to find the optimum level of parallelization.

The system administrator can also significantly improve performance by copying the contents to a new volume, preferably in approximately the same order that they will be accessed. I had to do this recently, and it reduced backup time (for a volume with about 14 million files) from 85 hours to 18 hours.

You might also try OpenFileById() which may perform better for files in large directories, since it bypasses the need to enumerate the directory tree. However, I've never tried it myself, and it might not have much impact since the directory is likely to be cached anyway if you've just enumerated it.

You can also enumerate the files on the disk more quickly by reading them from the MFT, although it sounds as if that isn't a bottleneck for you at the moment.

Community
  • 1
  • 1
Harry Johnston
  • 35,639
  • 6
  • 68
  • 158
2

There is an hack you can try: zip these files with a low compression ratio and then use some Zip Libraries to read them, this is usually way faster than reading the single files one by one. Of copurse this should be done in advance as a pre process step.

Felice Pollano
  • 32,832
  • 9
  • 75
  • 115
  • Of course, the zip process itself will have to enumerate and open and close each of the files, so unless Amy needs to process the same files multiple times, I don't see how this will be any faster--you're still paying the cost. – Adrian McCarthy Jan 08 '15 at 19:08
  • 2
    @AdrianMcCarthy With a zip file there is only one "OS file" to open, and the individual extraction is entirely in user space bypassing any related kernel open/close handle overhead or directory enumeration.. thus if the zip file can itself be efficiently listed/seeked (and using STORE for the data), then it might pay off in the given scenario. But I'd like to see tests either way :) – user2864740 Jan 08 '15 at 19:21
  • @user2864740: My point is that to create the zip file, you have to enumerate and open and close all the same files that Amy is trying to access directly today. Unless the zip creator knows a trick that Amy doesn't, this is--at best--going to be equivalent to her solution. – Adrian McCarthy Jan 08 '15 at 19:31
  • 2
    @AdrianMcCarthy Supposedly the zip would be generated ahead of time and this process would be done multiple times (or the zip generated as some background / nightly / off-time process), but if not .. – user2864740 Jan 08 '15 at 19:32
  • 1
    @user2864740: Amy described the application as a backup application, so it seems likely that a every file must be visited exactly once, thus a preprocessing step doesn't seem to be a win. – Adrian McCarthy Jan 08 '15 at 20:53
  • 1
    You might try to have less files (but bigger ones). Did you consider storing the data in some [sqlite](http://sqlite.org/) database instead? Or use some indexed file like [GDBM](http://www.gnu.org/s/gdbm) ? – Basile Starynkevitch Jan 08 '15 at 20:55
1

You might try doing one pass to enumerate the files to a data structure and then open and close them in a second pass, to see whether interleaving the operations is causing contention.

As I posted in the comments, there are lots of performance concerns about having huge numbers of entries in a single NTFS directory. So if you have control over how those files are distributed across directories, you might want to take advantage of that.

Also check for anti-malware on your system. Some will slow down every file access by scanning the entire file each time you try to access it. Using Sysinternals Procmon can help you spot this kind of problem.

When trying to improve performance, it's a good idea to set a goal. How fast is fast enough?

EDIT: This part of the original answer doesn't apply unless you're using Windows XP or earlier:

Opening and closing each file will, by default, update the last-access time in the index. You could try an experiment where you turn that feature off via registry or command line and see how big of a difference it makes. I'm not sure if it's a feasible thing to do in your actual product, since it's a global setting.

Adrian McCarthy
  • 45,555
  • 16
  • 123
  • 175
  • I added some clarifications in the original post. As for "how fast is fast enough", I'd say taking the time that it takes now down to one fifth (1ms or less per file) would be acceptable. As I mentioned, I could use the MFT directly.. I just want to avoid that if possible – Amy Jan 08 '15 at 20:09
  • Last-access is off by default in modern versions of Windows. (Since Vista, I think.) – Harry Johnston Jan 08 '15 at 21:01
  • @HarryJohnston: You're right. I thought disabling it by default started in Windows 8, but it was actually Vista. – Adrian McCarthy Jan 08 '15 at 21:08
  • I think XP was the first version which provided an option to disable last access updating. By default it caches in such a way that it won't write last access timestamps more than once per hour (which can be changed to update immediately). – wallyk Jan 10 '15 at 05:49
1

NTFS is slow with large number of files. Especially if they are in the same directory. When they are divided in separate dirs and subdirs, the access is faster. I have experience with many files stored by video camera board (4 cameras) and it was too slow even to see the number of files and size (Properties on root folder). It is interesting that when the disk is FAT32, the same is much faster. And all sources say that NTFS is faster... Maybe is faster for reading of single file, but directory operations are slower.

Why you need so many files? I hope directory indexing service is enabled.

i486
  • 6,491
  • 4
  • 24
  • 41