5

I have to read many (up to 5 mio.) small (9 KB) files. At the moment they are all in one directory. I fear this will take quadratic time or even n^2 log n for look up, is this right? Is this significant (will the lookup take more time than the actual reading)? Is there a difference in the asymptotic behavior of the running time when the file are cached by the OS?

I use C++-streams for reading the files. At the moment I'm using Windows 7 with NTFS, but I will later run the program on a linux cluster (not sure which file system).

VPfB
  • 14,927
  • 6
  • 41
  • 75
  • Use memory mapped I/O. Based on my testing, this is the single biggest file I/O performance improvement you can make. – Violet Giraffe Sep 12 '16 at 12:35
  • Can you change the one-directory limit? You may find some useful information here: http://stackoverflow.com/questions/8238860/maximum-number-of-files-folders-on-linux – VPfB Sep 12 '16 at 12:37

1 Answers1

4

It might not be that bad : if you enumerate files, and process each filename as you encounter it, your OS is quite likely to have the directory entry in its disk cache. And for practical purposes, a disk cache is O(1).

What will kill you is a mechanical HDD. You'll have 5 million disk seeks, each of which takes ~1/100th of a second. That is 50.000 seconds, more than a half day. This is a task that screams for an SSD.

MSalters
  • 173,980
  • 10
  • 155
  • 350