3

I am looking for a fast way to find the number of files in a directory on Linux.

Any solution that takes linear time in the number of files in the directory is NOT acceptable (e.g. "ls | wc -l" and similar things) because it would take a prohibitively long amount of time (there are tens or maybe hundreds of millions of files in the directory).

I'm sure the number of files in the directory must be stored as a simple number somewhere in the filesystem structure (inode perhaps?), as part of the data structure used to store the directory entries - how can I get to this number?

Edit: The filesystem is ext3. If there is no portable way of doing this, I am willing to do something specific to ext3.

HighCommander4
  • 50,428
  • 24
  • 122
  • 194
  • Almost duplicate: http://stackoverflow.com/questions/1427032/fast-linux-file-count-for-a-large-number-of-files, talks about how to speed up the standard ls | wc-l – Mark Jul 19 '10 at 17:53
  • I don't think that this is stored somewhere as a plain number.(I did NOT read the spec though). Simply because it would slow down the FS, you would need to synchronize touch/unlink/mv etc. to get a reliable result, also in case of a crash the number could be corrupted, so you would need to recount the files at some point. Also, at least on my Ubuntu Nautilus caches the number of objects in a directory by itself, if there would be a number in the underlying FS I don't think it would do that. – Ivo Wetzel Jul 19 '10 at 17:57
  • I'm wondering... is the size of the directory entry (i.e. the size you see for the directory when you do ls -l in its parent directory) related to the number of entries? It does seem to be larger than usual for this directory. – HighCommander4 Jul 19 '10 at 19:53
  • the size of directory can be correlated with max number of files what was _ever_ stored in it. Directory in a way is a plain file containing sparse array with pointers to actual files. – Dummy00001 Jul 19 '10 at 20:07
  • "tens or maybe hundreds of millions of files" is a pathological case. A large number of files in a directory *does* affect performance; this is why `/usr/share/terminfo` has a subdirectory for each initial character used by an entry, so it can be traversed more like a tree to keep file counts down. There are filesystems that are more akin to a database, where the count boils down to a single fast query, but those aren't common (if they exist *at all*, IDK) in the Unix world. – Stephen P Jul 20 '10 at 00:59

5 Answers5

6

Why should the data structure contain the number? A tree doesn't need to know its size in O(1), unless it's a requirement (and providing that, could require more locking and possibly a performance bottleneck)

By tree I don't mean including subdir contents, but files with -maxdepth 1 -- supposing they are not really stored as a list..

edit: ext2 stored them as a linked list.

modern ext3 implements hashed B-Trees

Having said that, /bin/ls does a lot more than counting, and actually scans all the inodes. Write your own C program or script using opendir() and readdir().

from here:

#include <stdio.h>
#include <sys/types.h>
#include <dirent.h>
int main()
{
        int count;
        struct DIR *d;
        if( (d = opendir(".")) != NULL)
        {
                for(count = 0;  readdir(d) != NULL; count++);
                closedir(d);
        }
        printf("\n %d", count);
        return 0;
}
Marco Mariani
  • 13,556
  • 6
  • 39
  • 55
  • 2
    Actually `ls -a` doesn't read more data from the filesystem than your program, as long as you don't pass other options like `--color` or `-F`. Beware that the count returned by `ls -a` or your program includes the `.` and `..` entries (so an empty directory has two entries). On Linux, `ls -A` skips `.` and `..`. – Gilles 'SO- stop being evil' Jul 19 '10 at 20:01
  • and where does it get the file names? i seem to remember getting them requires reading the inode. but it's been a long time, you may be right. – Marco Mariani Jul 19 '10 at 20:10
  • 2
    @Gilles is right - the filenames are in the directory, not the file inode (after all, a single file inode can have many names). The filenames are available to the program you've written, in `d->d_name`). – caf Jul 20 '10 at 01:27
2

You can use inotify to track and record file create and unlink events in the monitored directory. It would distribute the total time required to maintain file count and allow you to retrieve the current file count instantaneously.

Amardeep AC9MF
  • 18,464
  • 5
  • 40
  • 50
1

The inode for the directory does not store the number of files in it, since usually the file count is not needed separately from the list of names in the directory. The directory inode's link count does indirectly give the number of sub-directories (st_nlink is number of sub-dirs plus two).

I think you have no choice except read through the whole list of files in the directory. find might or might not be faster than ls.

This is an example of why large directories are a problem, even when the directory is implemented using a B-tree.

0

There's no portable way to do this. The low-level file primitives, i.e. readdir, work as if it's a linear list. Clearly, that's an abstraction, and some filesystems might store a count. However, accessing it is inherently filesystem-specific.

Matthew Flaschen
  • 278,309
  • 50
  • 514
  • 539
0

If you are willing to jump through hoops you may have each directory in a different filesystem, use quotas, and get the info with the "repquota" command.

Mark Wagner
  • 111
  • 2