I am writing a program that should process many small files, say thousands or even millions. I've been testing that part on 500k files, and the first step was just to iterate a directory which has around 45k directories in it (including subdirs of subdirs, etc), and 500k small files. The traversal of all directories and files, including getting file sizes and calculating total size takes about 6 seconds . Now, if I try to open each file while traversing and close it immediately it looks like it never stops. In fact, it takes way too long (hours...). Since I do this on Windows, I tried opening the files with CreateFileW, _wfopen and _wopen. I didn't read or write anything on the files, although in the final implementation I'll need to read only. However, I didn't see a noticeable improvement in any of the attempts.
I wonder if there's a more efficient way to open the files with any of the available functions, whether it's C, C++ or Windows API, or the only more efficient way will be to read the MFT and read blocks of the disk directly, which I am trying to avoid?
Update: The application that I am working on is doing backup snapshots with versioning. So, it also has incremental backups. The test with 500k files is done on a huge source code repository in order to do versioning, something like a scm. So, all files are not in one directory. There are around 45k directories as well (mentioned above).
So, the proposed solution to zip the files doesn't help, because when the backup is done, that's when all files are accessed. Hence, I'll see no benefit from that, and it'll even incur some performance cost.