Fast Linux file count for a large number of files

Question

I'm trying to figure out the best way to find the number of files in a particular directory when there are a very large number of files (more than 100,000).

When there are that many files, performing ls | wc -l takes quite a long time to execute. I believe this is because it's returning the names of all the files. I'm trying to take up as little of the disk I/O as possible.

I have experimented with some shell and Perl scripts to no avail. How can I do it?

make sure that your "ls" is /usr/bin/ls and not an alias to something fancier. — glenn jackman, Sep 15 '09 at 14:14
Similar question with interesting answers here: http://serverfault.com/questions/205071/fast-way-to-recursively-count-files-in-linux — aidan, Nov 24 '10 at 17:28
Its worth pointing out that most if not all the solutions presented to this question are not specific to *Linux*, but are pretty general to all *NIX-like systems. Perhaps removing the "Linux" tag is appropriate. — Christopher Schultz, Apr 08 '18 at 20:09

score 257 · Answer 1 · edited Aug 19 '21 at 08:32

257

By default ls sorts the names, which can take a while if there are a lot of them. Also there will be no output until all of the names are read and sorted. Use the ls -f option to turn off sorting.

ls -f | wc -l

Note: This will also enable -a, so ., .., and other files starting with . will be counted.

edited Aug 19 '21 at 08:32

joe

3,752
1
32
41

answered Sep 15 '09 at 13:55

mark4o

58,919
18
87
102

20

+1 And I thought I knew everything there was to know about `ls`. – mob Sep 15 '09 at 13:58
7

ZOMG. Sorting of 100K lines is nothing - compared to the `stat()` call `ls` does on every file. `find` doesn't `stat()` thus it works faster. – Dummy00001 Jul 19 '10 at 20:03
16

`ls -f` does not `stat()` either. But of course both `ls` and `find` call `stat()` when certain options are used, such as `ls -l` or `find -mtime`. – mark4o Jul 19 '10 at 23:46
I was happy today that I found the option -1 (one) for ls that boosted the time for the word count, but the -f option is phenomenal!!! Thanks! – sdmythos_gr Sep 19 '11 at 12:44
10

For context, this took 1-2 minutes to count 2.5 million jpgs on a small-ish Slicehost box. – philfreo Dec 23 '11 at 18:18
@mark4o: don't you also need a -1 in the call to ls to ensure that files will be listed on separate lines? as in `ls -1f | wc -l` – Bryan P Apr 07 '12 at 21:06
7

@BryanP: `-1` is the default when stdout is not a terminal (in this case ls stdout is a pipe) – mark4o Apr 07 '12 at 21:24
9

If you want to add subdirectories to the count, do `ls -fR | wc -l` – Ryan Walls Dec 22 '12 at 18:20
5

Easy to remember: `ls -f1 | wc -l` # F1 is very fast – Jan Wikholm Mar 10 '14 at 04:40
thank you! trying to count 500k+ files and ls -f is seconds compared to 10s of minutes. – pjreddie Oct 24 '14 at 06:06
1

@Dummy00001 A bit of empirical testing shows me that, on my filesystem with a directory of ~40k files (i.e. relatively *small*), using `ls -f1` versus `ls -1` gives me an order of magnitude of improvement. And that's running the two of them, over and over again, back and forth, one after another (so it's not just a cache observation). So, while `stat()` is expensive, avoiding the sorting cannot be ignored. – Christopher Schultz Feb 06 '15 at 14:54
if dont want the . and .. and anything starting with . in the count `ls |wc -l` – Asanke May 29 '17 at 06:23
Worked in less than 5 minutes for a directory of 4.3 million files on a Google Compute Engine instance under normal production load. Seems to scale reasonably. – LP Papillon Jan 30 '20 at 14:31
for me within a or two minutes for 10 million -f worked great. Thanks – Dev May 06 '21 at 06:23
@RyanWalls `ls -fR | wc -l` counts more than just files and directories. `ls -fR` includes `.`, `..` and empty lines. But even when stripping those away with `ls -f1RA . | sed '/^$/d' | wc -l` it still counts more than the `find . -printf x | wc -c` suggested by [ives](https://stackoverflow.com/questions/1427032/fast-linux-file-count-for-a-large-number-of-files#comment112632131_1427098) – jakun Mar 08 '22 at 17:40
1

Maybe bc I am a noob - but I feel it is important to note a complete example that includes a path: `ls -f "/yourpath/" | wc -l` and note that you do not want `ls -f "/yourpath/"* | wc -l` – spioter Oct 31 '22 at 14:38

Christopher Schultz · Answer 2 · 2017-07-11T17:22:11.043

The fastest way is a purpose-built program, like this:

#include <stdio.h>
#include <dirent.h>

int main(int argc, char *argv[]) {
    DIR *dir;
    struct dirent *ent;
    long count = 0;

    dir = opendir(argv[1]);

    while((ent = readdir(dir)))
            ++count;

    closedir(dir);

    printf("%s contains %ld files\n", argv[1], count);

    return 0;
}

From my testing without regard to cache, I ran each of these about 50 times each against the same directory, over and over, to avoid cache-based data skew, and I got roughly the following performance numbers (in real clock time):

ls -1  | wc - 0:01.67
ls -f1 | wc - 0:00.14
find   | wc - 0:00.22
dircnt | wc - 0:00.04

That last one, dircnt, is the program compiled from the above source.

EDIT 2016-09-26

Due to popular demand, I've re-written this program to be recursive, so it will drop into subdirectories and continue to count files and directories separately.

Since it's clear some folks want to know how to do all this, I have a lot of comments in the code to try to make it obvious what's going on. I wrote this and tested it on 64-bit Linux, but it should work on any POSIX-compliant system, including Microsoft Windows. Bug reports are welcome; I'm happy to update this if you can't get it working on your AIX or OS/400 or whatever.

As you can see, it's much more complicated than the original and necessarily so: at least one function must exist to be called recursively unless you want the code to become very complex (e.g. managing a subdirectory stack and processing that in a single loop). Since we have to check file types, differences between different OSs, standard libraries, etc. come into play, so I have written a program that tries to be usable on any system where it will compile.

There is very little error checking, and the count function itself doesn't really report errors. The only calls that can really fail are opendir and stat (if you aren't lucky and have a system where dirent contains the file type already). I'm not paranoid about checking the total length of the subdir pathnames, but theoretically, the system shouldn't allow any path name that is longer than than PATH_MAX. If there are concerns, I can fix that, but it's just more code that needs to be explained to someone learning to write C. This program is intended to be an example of how to dive into subdirectories recursively.

#include <stdio.h>
#include <dirent.h>
#include <string.h>
#include <stdlib.h>
#include <limits.h>
#include <sys/stat.h>

#if defined(WIN32) || defined(_WIN32) 
#define PATH_SEPARATOR '\\' 
#else
#define PATH_SEPARATOR '/' 
#endif

/* A custom structure to hold separate file and directory counts */
struct filecount {
  long dirs;
  long files;
};

/*
 * counts the number of files and directories in the specified directory.
 *
 * path - relative pathname of a directory whose files should be counted
 * counts - pointer to struct containing file/dir counts
 */
void count(char *path, struct filecount *counts) {
    DIR *dir;                /* dir structure we are reading */
    struct dirent *ent;      /* directory entry currently being processed */
    char subpath[PATH_MAX];  /* buffer for building complete subdir and file names */
    /* Some systems don't have dirent.d_type field; we'll have to use stat() instead */
#if !defined ( _DIRENT_HAVE_D_TYPE )
    struct stat statbuf;     /* buffer for stat() info */
#endif

/* fprintf(stderr, "Opening dir %s\n", path); */
    dir = opendir(path);

    /* opendir failed... file likely doesn't exist or isn't a directory */
    if(NULL == dir) {
        perror(path);
        return;
    }

    while((ent = readdir(dir))) {
      if (strlen(path) + 1 + strlen(ent->d_name) > PATH_MAX) {
          fprintf(stdout, "path too long (%ld) %s%c%s", (strlen(path) + 1 + strlen(ent->d_name)), path, PATH_SEPARATOR, ent->d_name);
          return;
      }

/* Use dirent.d_type if present, otherwise use stat() */
#if defined ( _DIRENT_HAVE_D_TYPE )
/* fprintf(stderr, "Using dirent.d_type\n"); */
      if(DT_DIR == ent->d_type) {
#else
/* fprintf(stderr, "Don't have dirent.d_type, falling back to using stat()\n"); */
      sprintf(subpath, "%s%c%s", path, PATH_SEPARATOR, ent->d_name);
      if(lstat(subpath, &statbuf)) {
          perror(subpath);
          return;
      }

      if(S_ISDIR(statbuf.st_mode)) {
#endif
          /* Skip "." and ".." directory entries... they are not "real" directories */
          if(0 == strcmp("..", ent->d_name) || 0 == strcmp(".", ent->d_name)) {
/*              fprintf(stderr, "This is %s, skipping\n", ent->d_name); */
          } else {
              sprintf(subpath, "%s%c%s", path, PATH_SEPARATOR, ent->d_name);
              counts->dirs++;
              count(subpath, counts);
          }
      } else {
          counts->files++;
      }
    }

/* fprintf(stderr, "Closing dir %s\n", path); */
    closedir(dir);
}

int main(int argc, char *argv[]) {
    struct filecount counts;
    counts.files = 0;
    counts.dirs = 0;
    count(argv[1], &counts);

    /* If we found nothing, this is probably an error which has already been printed */
    if(0 < counts.files || 0 < counts.dirs) {
        printf("%s contains %ld files and %ld directories\n", argv[1], counts.files, counts.dirs);
    }

    return 0;
}

EDIT 2017-01-17

I've incorporated two changes suggested by @FlyingCodeMonkey:

Use lstat instead of stat. This will change the behavior of the program if you have symlinked directories in the directory you are scanning. The previous behavior was that the (linked) subdirectory would have its file count added to the overall count; the new behavior is that the linked directory will count as a single file, and its contents will not be counted.
If the path of a file is too long, an error message will be emitted and the program will halt.

EDIT 2017-06-29

With any luck, this will be the last edit of this answer :)

I've copied this code into a GitHub repository to make it a bit easier to get the code (instead of copy/paste, you can just download the source), plus it makes it easier for anyone to suggest a modification by submitting a pull-request from GitHub.

The source is available under Apache License 2.0. Patches^* welcome!

"patch" is what old people like me call a "pull request".

Just great! thanks! And for those unaware: you can complile the above code in the terminal: `gcc -o dircnt dircnt.c` and use is like this `./dircnt some_dir` — aesede, Mar 19 '15 at 18:51
@ck_ Sure, this can easily be made recursive. Do you need help with the solution, or do you want me to write the whole thing? — Christopher Schultz, Sep 28 '15 at 18:41
How to adjust this to count FILES in a directory recursively (meaning including ALL subdirectories, I have millions of files)? — Carmageddon, Sep 19 '16 at 12:06
@ChristopherSchultz, I know it's been awhile, but it would be really helpful if you made the recursive. Thank you. — Jim Walker, Sep 22 '16 at 16:43
@ChristopherSchultz, your effort is much appreciated. I am dealing with a filesystem containing over 75 million files each ~10kb in size. This will be very helpful. Thank you. — Jim Walker, Sep 28 '16 at 05:23
@ChristopherSchultz, the benchmarks you posted above - how big was the directory in question? — Dom Vinyard, Oct 17 '16 at 14:00
@DomVinyard I can't remember, but you simply will not beat a custom-built program because it doesn't do any of the stuff that you don't want it to do (sorting, extra `stat` calls, etc.) regardless of the size of the directory. I'd appreciate some feedback if you are finding that another method is faster. Remember that disk caching and filesystem caching can really skew the results, so if you run `dircnt` first, then run `find`, you'll see that the second operation is much faster due to caching. I'd recommend running each of them several times to determine the correct timing. — Christopher Schultz, Oct 18 '16 at 16:15
Quick comment regarding this:`/* Some systems don't have dirent.d_type field; we'll have to use stat() instead */`. That's true, but even if the system does have the `d_type` field in dirent, the filesystem may not support populating it with a meaningful value, and may always return `DT_UNKNOWN`. So, you always have to be prepared to do a `stat`. — Gary R. Van Sickle, Dec 24 '16 at 15:36
@GaryR.VanSickle Aah. I'll happily add that to the code posted here, then. — Christopher Schultz, Dec 24 '16 at 22:43
Although I can't use your great `c` code in my bash application, +1 for benchmark testing revealing `ls -f1 | wc - 0:00.14`. Although `ls` might count extra files for spaces in file names and `\n` (new lines) in file names, the count is only for a `yad` progress bar pre-setup and works for my purposes. — WinEunuuchs2Unix, Apr 15 '17 at 15:24
This is fast but I think using sys call should be faster, please check my answer below. — Nikolay Dimitrov, Oct 16 '17 at 08:08
I really wanted to use this in Python so I packaged it up as the [ffcount](https://github.com/GjjvdBurg/ffcount) package. Thanks for making the code available @ChristopherSchultz! — GjjvdBurg, Mar 23 '18 at 22:02
To add to this response, I also recommend using a tool like `parallel` if you have a large number of directories or files that are spread out across multiple disks (like a RAID partition or networked filesystem, such as NFS, EBS/iSCSI, etc.). By splitting the entire directory tree into groups, and counting each one concurrently, you might be able to further reduce overall time. — ives, Sep 01 '20 at 20:12
@ives If you'd like to submit a PR to allow dircnt to run under pthreads or something like that, I could put it in there as an option. I'm not motivated to do it myself for a number of reasons, including the fact that it will likely complicate the code quite a bit and make it harder for a beginner to understand. — Christopher Schultz, Sep 02 '20 at 13:37
I'm not saying this is fast, but I fired off a 'find', realised it would take ages but left it running anyway, googled for better tools, found this page, installed C tools & extensions to vscode so I could compile C programs, pasted this code into vscode, built and executed this code to get the result I needed, and was done *before* find had finished. — Matt Parkins, Jun 06 '23 at 08:10

score 41 · Answer 3 · edited Nov 09 '20 at 01:16

41

Use find. For example:

find . -name "*.ext" | wc -l

edited Nov 09 '20 at 01:16

Peter Mortensen

30,738
21
105
131

answered Sep 15 '09 at 13:12

igustin

1,110
8
8

1

This will *recursively* find files under the current directory. – mark4o Sep 15 '09 at 14:01
On my system, `find /usr/share | wc -l` (~137,000 files) is about 25% faster than `ls -R /usr/share | wc -l` (~160,000 lines including dir names, dir totals and blank lines) on the first run of each and at least twice as fast when comparing subsequent (cached) runs. – Dennis Williamson Sep 15 '09 at 14:01
14

If he want only current directory, not the whole tree recursively, he can add -maxdepth 1 option to find. – igustin Sep 15 '09 at 14:47
4

It seems the reason `find` is faster than `ls` is because of how you are using `ls`. If you stop sorting, `ls` and `find` have similar performance. – Christopher Schultz Feb 06 '15 at 14:56
3

you can speed up find + wc by printing only a single character: `find . -printf x | wc -c`. otherwise you're creating strings from the entire path and passing that to wc (extra I/O). – ives Sep 01 '20 at 19:07
2

You should be using `-printf` as @ives shows anyway, so the count is correct when some joker writes filenames with newlines in them. – Toby Speight May 16 '21 at 15:47

score 20 · Answer 4 · edited Nov 09 '20 at 01:32

20

find, ls, and perl tested against 40,000 files has the same speed (though I didn't try to clear the cache):

[user@server logs]$ time find . | wc -l
42917

real    0m0.054s
user    0m0.018s
sys     0m0.040s

[user@server logs]$ time /bin/ls -f | wc -l
42918

real    0m0.059s
user    0m0.027s
sys     0m0.037s

And with Perl's opendir and readdir, the same time:

[user@server logs]$ time perl -e 'opendir D, "."; @files = readdir D; closedir D; print scalar(@files)."\n"'
42918

real    0m0.057s
user    0m0.024s
sys     0m0.033s

Note: I used /bin/ls -f to make sure to bypass the alias option which might slow a little bit and -f to avoid file ordering. ls without -f is twice slower than find/perl except if ls is used with -f, it seems to be the same time:

[user@server logs]$ time /bin/ls . | wc -l
42916

real    0m0.109s
user    0m0.070s
sys     0m0.044s

I also would like to have some script to ask the file system directly without all the unnecessary information.

_{The tests were based on the answers of Peter van der Heijden, glenn jackman, and mark4o.}

edited Nov 09 '20 at 01:32

Peter Mortensen

30,738
21
105
131

answered Feb 24 '11 at 15:40

Thomas BDX

2,632
2
27
31

7

You should definitely clear the cache between tests. The first time I run `ls -l | wc -l` on a folder on an external 2.5" HDD with 1M files, it takes about 3 mins for the operation to finish. The second time it takes 12 seconds IIRC. Also this could potentially depend on your file system too. I was using `Btrfs`. – Behrang May 02 '16 at 04:58
Thank you, perl snippet is solution for me. `$ time perl -e 'opendir D, "."; @files = readdir D; closedir D; print scalar(@files)."\n"' 1315029 real 0m0.580s user 0m0.302s sys 0m0.275s` – Pažout Jun 15 '18 at 13:04
you can speed up find + wc by printing only a single character: `find . -printf x | wc -c`. otherwise you're creating strings from the entire path and passing that to wc (extra I/O). – ives Sep 01 '20 at 19:07

Bogdan Stăncescu · Answer 5 · 2012-11-29T15:42:36.043

Surprisingly for me, a bare-bones find is very much comparable to ls -f

> time ls -f my_dir | wc -l
17626

real    0m0.015s
user    0m0.011s
sys     0m0.009s

versus

> time find my_dir -maxdepth 1 | wc -l
17625

real    0m0.014s
user    0m0.008s
sys     0m0.010s

Of course, the values on the third decimal place shift around a bit every time you execute any of these, so they're basically identical. Notice however that find returns one extra unit, because it counts the actual directory itself (and, as mentioned before, ls -f returns two extra units, since it also counts . and ..).

score 7 · Answer 6 · edited Nov 09 '20 at 03:12

7

Fast Linux file count

The fastest Linux file count I know is

locate -c -r '/home'

There is no need to invoke grep! But as mentioned, you should have a fresh database (updated daily by a cron job, or manual by sudo updatedb).

From man locate

-c, --count
    Instead  of  writing  file  names on standard output, write the number of matching
    entries only.

Additional, you should know that it also counts the directories as files!

BTW: If you want an overview of your files and directories on your system type

locate -S

It outputs the number of directories, files, etc.

edited Nov 09 '20 at 03:12

Peter Mortensen

30,738
21
105
131

answered Apr 23 '18 at 23:44

abu_bua

1,361
17
25

1

note that you have to make sure that the database is up-to-date – phuclv Sep 25 '18 at 02:40
3

LOL if you have all the counts in a database already, then you can certainly count quickly. :) – Christopher Schultz Jun 17 '19 at 14:40
this is reasonable for approximate values and estimates, but wouldn't be suitable for tasks like verifying data migration. – ives May 27 '21 at 15:23

score 4 · Answer 7 · edited Nov 09 '20 at 01:38

4

You can change the output based on your requirements, but here is a Bash one-liner I wrote to recursively count and report the number of files in a series of numerically named directories.

dir=/tmp/count_these/ ; for i in $(ls -1 ${dir} | sort -n) ; { echo "$i => $(find ${dir}${i} -type f | wc -l),"; }

This looks recursively for all files (not directories) in the given directory and returns the results in a hash-like format. Simple tweaks to the find command could make what kind of files you're looking to count more specific, etc.

It results in something like this:

1 => 38,
65 => 95052,
66 => 12823,
67 => 10572,
69 => 67275,
70 => 8105,
71 => 42052,
72 => 1184,

edited Nov 09 '20 at 01:38

Peter Mortensen

30,738
21
105
131

answered Feb 13 '12 at 18:44

mightybs

149
1
2

2

I found the example a little bit confusing. I was wondering why there were numbers on the left, instead of directory names. Thank you for this though, I ended up using it with a few minor tweaks. (counting directories and dropping the base folder name. for i in $(ls -1 . | sort -n) ; { echo "$i => $(find ${i} | wc -l)"; } – TheJacobTaylor Apr 27 '12 at 00:00
1

The numbers on the left are my directory names from my example data. Sorry that was confusing. – mightybs Mar 19 '14 at 09:58
1

`ls -1 ${dir}` won't work properly without more spaces. Also, there's no guarantee that the name returned by `ls` can be passed to `find`, as `ls` escapes non-printable characters for human consumption. (`mkdir $'oddly\nnamed\ndirectory'` if you want a particularly interesting test case). See [Why you shouldn't parse the output of ls(1)](http://mywiki.wooledge.org/ParsingLs) – Charles Duffy Feb 10 '17 at 16:37

score 3 · Answer 8 · edited Nov 09 '20 at 01:46

You can get a count of files and directories with the tree program.

Run the command tree | tail -n 1 to get the last line, which will say something like "763 directories, 9290 files". This counts files and folders recursively, excluding hidden files, which can be added with the flag -a. For reference, it took 4.8 seconds on my computer, for tree to count my whole home directory, which was 24,777 directories, 238,680 files. find -type f | wc -l took 5.3 seconds, half a second longer, so I think tree is pretty competitive speed-wise.

As long as you don't have any subfolders, tree is a quick and easy way to count the files.

Also, and purely for the fun of it, you can use tree | grep '^├' to only show the files/folders in the current directory - this is basically a much slower version of ls.

@TheUnfunCat `tail` should already be installed on your Mac OS X system. — Christopher Schultz, Aug 27 '15 at 14:04

Mohammad Anini · Answer 9 · 2020-11-11T19:26:38.473

3

ls spends more time sorting the files names. Use -f to disable the sorting, which will save some time:

ls -f | wc -l

Or you can use find:

find . -type f | wc -l

edited Nov 11 '20 at 19:26

answered Jun 25 '15 at 21:08

Mohammad Anini

5,073
4
35
46

score 2 · Answer 10 · edited Nov 09 '20 at 02:31

I came here when trying to count the files in a data set of approximately 10,000 folders with approximately 10,000 files each. The problem with many of the approaches is that they implicitly stat 100 million files, which takes ages.

I took the liberty to extend the approach by Christopher Schultz so it supports passing directories via arguments (his recursive approach uses stat as well).

Put the following into file dircnt_args.c:

#include <stdio.h>
#include <dirent.h>

int main(int argc, char *argv[]) {
    DIR *dir;
    struct dirent *ent;
    long count;
    long countsum = 0;
    int i;

    for(i=1; i < argc; i++) {
        dir = opendir(argv[i]);
        count = 0;
        while((ent = readdir(dir)))
            ++count;

        closedir(dir);

        printf("%s contains %ld files\n", argv[i], count);
        countsum += count;
    }
    printf("sum: %ld\n", countsum);

    return 0;
}

After a gcc -o dircnt_args dircnt_args.c you can invoke it like this:

dircnt_args /your/directory/*

On 100 million files in 10,000 folders, the above completes quite quickly (approximately 5 minutes for the first run, and followup on cache: approximately 23 seconds).

The only other approach that finished in less than an hour was ls with about 1 min on cache: ls -f /your/directory/* | wc -l. The count is off by a couple of newlines per directory though...

Other than expected, none of my attempts with find returned within an hour :-/

For somebody that's not a C programmer, can you explain why this would be faster, and how it is able to get the same answer without doing the same thing? — mlissner, May 01 '18 at 07:45
you don't need to be a C programmer, just understand what it means to stat a file and how directories are represented: directories are essentially lists of filenames and inodes. If you stat a file you access the inode which is somewhere on the drive to for example get info like file-size, permissions, ... . If you're just interested in the counts per dir you do not need to access the inode info, which might save you a lot of time. — Jörn Hees, May 24 '18 at 18:58
This segfaults on Oracle linux, gcc version 4.8.5 20150623 (Red Hat 4.8.5-28.0.1) (GCC)... relative paths and remote fs's seem to be the cause — Rondo, Oct 23 '18 at 00:01
Re *"The count is off by a couple of newlines per directory though"*: This can be fixed by combining `-f` with `-A` (uppercase 'a'): `ls -f -A`. The option `-f` enables `-a` (lowercase 'a'), but it can be overridden with `-A`. This was tested with `ls` version 8.30. — Peter Mortensen, Nov 09 '20 at 02:36

score 2 · Answer 11 · edited Nov 09 '20 at 02:50

The fastest way on Linux (the question is tagged as Linux), is to use a direct system call. Here's a little program that counts files (only, no directories) in a directory. You can count millions of files and it is around 2.5 times faster than "ls -f" and around 1.3-1.5 times faster than Christopher Schultz's answer.

#define _GNU_SOURCE
#include <dirent.h>
#include <stdio.h>
#include <fcntl.h>
#include <stdlib.h>
#include <sys/syscall.h>

#define BUF_SIZE 4096

struct linux_dirent {
    long d_ino;
    off_t d_off;
    unsigned short d_reclen;
    char d_name[];
};

int countDir(char *dir) {

    int fd, nread, bpos, numFiles = 0;
    char d_type, buf[BUF_SIZE];
    struct linux_dirent *dirEntry;

    fd = open(dir, O_RDONLY | O_DIRECTORY);
    if (fd == -1) {
        puts("open directory error");
        exit(3);
    }
    while (1) {
        nread = syscall(SYS_getdents, fd, buf, BUF_SIZE);
        if (nread == -1) {
            puts("getdents error");
            exit(1);
        }
        if (nread == 0) {
            break;
        }

        for (bpos = 0; bpos < nread;) {
            dirEntry = (struct linux_dirent *) (buf + bpos);
            d_type = *(buf + bpos + dirEntry->d_reclen - 1);
            if (d_type == DT_REG) {
                // Increase counter
                numFiles++;
            }
            bpos += dirEntry->d_reclen;
        }
    }
    close(fd);

    return numFiles;
}

int main(int argc, char **argv) {

    if (argc != 2) {
        puts("Pass directory as parameter");
        return 2;
    }
    printf("Number of files in %s: %d\n", argv[1], countDir(argv[1]));
    return 0;
}

PS: It is not recursive, but you could modify it to achieve that.

I'm not sure I agree that this is faster. I haven't traced-through everything that the compiler does with `opendir`/`readdir`, but I suspect it boils down to almost the same code in the end. Making system calls that way is also not portable and, as the Linux ABI is not stable, a program compiled on one system is not guaranteed to work properly on another (though it's fairly good advice to compile anything from source on any *NIX system IMO). If speed is key, this is a good solution if it actually improves speed -- I haven't benchmarked the programs separately. — Christopher Schultz, Oct 19 '17 at 13:07

score 2 · Answer 12 · edited Nov 09 '20 at 03:19

You should use "getdents" in place of ls/find

Here is one very good article which described the getdents approach.

http://be-n.com/spw/you-can-list-a-million-files-in-a-directory-but-not-with-ls.html

Here is the extract:

ls and practically every other method of listing a directory (including Python's os.listdir and find .) rely on libc readdir(). However, readdir() only reads 32K of directory entries at a time, which means that if you have a lot of files in the same directory (e.g., 500 million directory entries) it is going to take an insanely long time to read all the directory entries, especially on a slow disk. For directories containing a large number of files, you'll need to dig deeper than tools that rely on readdir(). You will need to use the getdents() system call directly, rather than helper methods from the C standard library.

We can find the C code to list the files using getdents() from here:

There are two modifications you will need to do in order quickly list all the files in a directory.

First, increase the buffer size from X to something like 5 megabytes.

#define BUF_SIZE 1024*1024*5

Then modify the main loop where it prints out the information about each file in the directory to skip entries with inode == 0. I did this by adding

if (dp->d_ino != 0) printf(...);

In my case I also really only cared about the file names in the directory so I also rewrote the printf() statement to only print the filename.

if(d->d_ino) printf("%sn ", (char *) d->d_name);

Compile it (it doesn't need any external libraries, so it's super simple to do)

gcc listdir.c -o listdir

Now just run

./listdir [directory with an insane number of files]

Note that Linux does a read-ahead, so `readdir()` is not actually slow. I need solid figure before I believe that it's worth to throw away portability for this performance gain. — fuz, May 12 '18 at 19:03
Can you add some benchmarks, comparing the two methods? Incl. under which conditions, e.g. number of files, cold/warm filesystem cache, hardware, disk type (HDD vs. SSD), file system type (e.g., ext4 or NTFS), disk fragmentation state, computer system, and operating system (e.g. Ubuntu 16.04), with version information))? You can [edit your answer](https://stackoverflow.com/posts/49398863/edit) (but without "Edit:", "Update:", or similar). — Peter Mortensen, Nov 09 '20 at 03:04

score 1 · Answer 13 · edited Nov 09 '20 at 01:20

1

You could try if using opendir() and readdir() in Perl is faster. For an example of those function, look here.

edited Nov 09 '20 at 01:20

Peter Mortensen

30,738
21
105
131

answered Sep 15 '09 at 13:16

heijp06

11,558
1
40
60

2

usage: perl -e 'opendir D, "."; @files = readdir D; closedir D; print scalar(@files)' – glenn jackman Sep 15 '09 at 14:11
In a script: `#!/usr/bin/env perl5 use strict; use warnings; eval 'exec perl5 -S $0 ${1+"$@"}' if 0; # not running under some shell foreach (@ARGV) { opendir D, "$_"; my @files = readdir D; closedir D; print "$_: "; print scalar(@files); print "\n"; } ` – RJVB Aug 31 '23 at 13:40

score 1 · Answer 14 · edited Apr 13 '17 at 12:13

1

This answer here is faster than almost everything else on this page for very large, very nested directories:

https://serverfault.com/a/691372/84703

locate -r '.' | grep -c "^$PWD"

edited Apr 13 '17 at 12:13

Community

1
1

answered Sep 28 '15 at 14:22

ck_

3,353
5
31
33

3

Nice. Since you already have an up-to-date db of all files, no need to go at it again. But unfortunately , you must make sure the updatedb command has already run and completed for this method. – Chris Reid Mar 18 '18 at 22:00
1

you don't need to grep. Use `locate -c -r '/path'` like in [abu_bua's solution](https://stackoverflow.com/a/49991437/995714) – phuclv Sep 25 '18 at 02:42

score -2 · Answer 15 · edited May 12 '21 at 17:15

-2

I realized that not using in memory processing, when you have a huge amount of data, is faster than "piping" the commands. So I saved the result to a file and analyzed it afterwards:

ls -1 /path/to/dir > count.txt && wc-l count.txt

edited May 12 '21 at 17:15

Mat M

1,786
24
30

answered Feb 12 '16 at 13:25

Marcelo Luiz Onhate

501
8
17

1

this is not the fastest solution because hard disks are extremely slow. There are other more efficient ways that were posted years before you – phuclv Sep 25 '18 at 02:43
Can you add actual measurements for the two ways (piping and intermediate file) to your answer (incl. under which conditions, e.g. number of files, hardware, disk type (HDD vs. SSD), file system type (e.g., [ext4](https://en.wikipedia.org/wiki/Ext4) or [NTFS](https://en.wikipedia.org/wiki/NTFS)), disk fragmentation state, computer system, and operating system (e.g. [Ubuntu 16.04](https://en.wikipedia.org/wiki/Ubuntu_version_history)), with version information))? You can [edit your answer](https://stackoverflow.com/posts/35363784/edit) (but ***without*** "Edit:", "Update:", or similar). – Peter Mortensen Nov 09 '20 at 02:06

score -5 · Answer 16 · edited Nov 09 '20 at 01:43

-5

The first 10 directories with the highest number of files.

dir=/ ; for i in $(ls -1 ${dir} | sort -n) ; { echo "$(find ${dir}${i} \
    -type f | wc -l) => $i,"; } | sort -nr | head -10

edited Nov 09 '20 at 01:43

Peter Mortensen

30,738
21
105
131

answered Jul 03 '13 at 13:47

user2546874

1
1

4

This certainly looks astonishingly similar to the answer (with the same bugs) [written by mightybs](http://stackoverflow.com/a/9266059/14122). If you're going to extend or modify code written by someone else, crediting them is appropriate. Understanding the code you're using in your answers enough to identify and fix its bugs is even *more* appropriate. – Charles Duffy Feb 10 '17 at 16:39

score -7 · Answer 17 · edited Nov 09 '20 at 03:22

-7

I prefer the following command to keep track of the changes in the number of files in a directory.

watch -d -n 0.01 'ls | wc -l'

The command will keeps a window open to keep track of the number of files that are in the directory with a refresh rate of 0.1 seconds.

edited Nov 09 '20 at 03:22

Peter Mortensen

30,738
21
105
131

answered Aug 08 '18 at 12:31

Anoop Toffy

918
1
9
22

1

are you sure that `ls | wc -l` will finish for a folder with thousands or millions of files in 0.01s? even your `ls` is hugely inefficient compared to other solutions. And the OP just want to get the count, not sitting there looking at the output changing – phuclv Sep 25 '18 at 02:39
Well. Well. I found an elegant solution which works for me. I would like to share the same, hence did. I don't know 'ls' command in linux is highly inefficient. What are you using instead of that ? And 0.01s is the refresh rate. Not the time. if you havn't used watch please refer man pages. – Anoop Toffy Sep 25 '18 at 05:15
well I did read the `watch` manual after that comment and see that 0.01s (not 0.1s) is an unrealistic number because the refresh rate of most PC screens is only 60Hz, and this doesn't answer the question in any way. The OP asked about "Fast Linux File Count for a large number of files". You also didn't read any available answers before posting – phuclv Sep 25 '18 at 07:41
I did read the answers. But what I posted is a way of keeping track of changing number of file in a directory. for eg: while copying file from one location to another the number of file keeps changes. with the method I poster one can keep track of that. I agree that the post I made no where modify or improve any previous posts. – Anoop Toffy Sep 25 '18 at 10:08
The question specifically wants something that is _faster_ than `ls | wc -l`, which this clearly is not. – Toby Speight May 16 '21 at 15:52

Fast Linux file count for a large number of files

17 Answers17

Fast Linux file count

Linked