22

I have a program that produces large number of small files (say, 10,000 files). After they are created, another script accesses them and processes one by one.

Questions:

  • does it matter, in terms of performance, how the files are organized (all in one directory or in multiple directories)
  • if so, then what is the optimal number of directories and files per dir?

I run Debian with ext4 file system

Related

Community
  • 1
  • 1
Jakub M.
  • 32,471
  • 48
  • 110
  • 179
  • What do you mean by *large number* ? 10^3 ? 10^6 ? 10^9 ? Do you want the files to remain on disk after the script has read them or can you delete them ? – High Performance Mark Oct 24 '12 at 09:53
  • 2
    Good question, IMO; could be construed as subjective with no single definitive answer (a bit vague, considering hardware variance etc.) but I think that's being a bit intolerant to answering the problem, to which there will be valuable answers (and the experts will have an idea of the thresholds to work with to crunch numbers). – Grant Thomas Oct 24 '12 at 09:55
  • `10k`, as is the question. Why does it matter if I delete them later? – Jakub M. Oct 24 '12 at 09:56
  • 1
    As pointed out in the answers to this question -- http://stackoverflow.com/questions/8238860/maximum-number-of-files-folders-on-linux -- a directory containing a large number of files can be something of a challenge to many utilities such as `ls` and `rm. This can be ignored if you delete the files as soon as you are done with them. – High Performance Mark Oct 24 '12 at 13:05
  • 1
    Do you only want answers for your case (10k files, which is actually not very large these days for modern UNIX filesystems) or do you want general answers about a 'large number'? Do you want to know the asymptotic big-O behavior as N passes 10^6, 10^9 ... (I came here hoping to find that). – smci Nov 20 '17 at 23:14

1 Answers1

15

10k files inside a single folder is not a problem on Ext4. It should have the dir_index option enabled by default, which indexes directories content using a btree-like structure to prevent performance issues.

To sum up, unless you create millions of files or use ext2/ext3, you shouldn't have to worry about system or FS performance issues.

That being said, shell tools and commands don't like to be called with a lot of files as parameter ( rm * for example) and may return you an error message saying something like 'too many arguments'. Look at this answer for what happens then.

mbarthelemy
  • 12,465
  • 4
  • 41
  • 43
  • 2
    what if there are some 40 million small files in a directory what problems it can cause? – stackit Jun 13 '16 at 09:45
  • @stackit you can break the directory index and Linux will automatically make the filesystem read-only. See https://adammonsen.com/post/1555/ and https://access.redhat.com/solutions/29894 – Adam Monsen Dec 31 '18 at 01:51