How problematic is it to read many small files from one directory?

Question

I have to read many (up to 5 mio.) small (9 KB) files. At the moment they are all in one directory. I fear this will take quadratic time or even n^2 log n for look up, is this right? Is this significant (will the lookup take more time than the actual reading)? Is there a difference in the asymptotic behavior of the running time when the file are cached by the OS?

I use C++-streams for reading the files. At the moment I'm using Windows 7 with NTFS, but I will later run the program on a linux cluster (not sure which file system).

Use memory mapped I/O. Based on my testing, this is the single biggest file I/O performance improvement you can make. — Violet Giraffe, Sep 12 '16 at 12:35
Can you change the one-directory limit? You may find some useful information here: http://stackoverflow.com/questions/8238860/maximum-number-of-files-folders-on-linux — VPfB, Sep 12 '16 at 12:37

score 4 · Answer 1 · answered Sep 12 '16 at 13:06

It might not be that bad : if you enumerate files, and process each filename as you encounter it, your OS is quite likely to have the directory entry in its disk cache. And for practical purposes, a disk cache is O(1).

What will kill you is a mechanical HDD. You'll have 5 million disk seeks, each of which takes ~1/100th of a second. That is 50.000 seconds, more than a half day. This is a task that screams for an SSD.

How problematic is it to read many small files from one directory?

1 Answers1