25

Suppose a very simple program that lists out all the subdirectories of a given directory. Sound simple enough? Except the only way to list all subdirectories in Java is to use FilenameFilter combined with File.list().

This works for the trivial case, but when the folder has say 150,000 files and 2 sub folders, it's silly waiting there for 45 seconds iterating through all the files and testing for file.isDirectory(). Is there a better way to list sub directories??


PS. Sorry, please save the lectures on having too many files in the same directory. Our live environment has this as part of the requirement.

zmarties
  • 4,809
  • 22
  • 39
erotsppa
  • 14,248
  • 33
  • 123
  • 181
  • 7
    I would try to avoid getting into that situation in the first place. Having a vast number of files in a directory is likely to make any number of file system operations slow. – Jon Skeet Jun 23 '09 at 20:29
  • Sorry, our live environment has this as part of the requirement. – erotsppa Jun 23 '09 at 20:30
  • Most other code would have to do something similar anyway. Btw , a loop of 150000 iterations doesn't take 45 seconds. It's the IO that slows things down. – nos Jun 23 '09 at 20:40
  • 4
    NIO2 (Java7) will address this problem using a lazy iterator for directory listing – dfa Jun 23 '09 at 20:47
  • @dfa You'll still have to go through the entire list of files to make sure you found all the subdirectories, so is this really going to help except a few less strings being created all at once? – Hardwareguy Jun 23 '09 at 21:03
  • What system is this based on? And even more important: What file system? – Emil H Jun 23 '09 at 21:05
  • 2
    @Hardwareguy no it's lazy. You don't wait 45 seconds to access the first entry... – dfa Jun 23 '09 at 21:10
  • @dfa but, since you eventually have to access them all won't you still be doing the same amount of work on the disk? – Hardwareguy Jun 23 '09 at 21:12
  • 2
    @Hardwareguy you don't waste 150.000 array elements but you do the same amount of work on the case (in the worst case) – dfa Jun 23 '09 at 21:20
  • @dfa there is only a worst case because he won't know if he's gotten all the subdirectories until he has checked every file – Hardwareguy Jun 23 '09 at 21:21
  • 2
    @erotsppa - were any of the answers helpful? It is considered good form to accept one of them that was most helpful. – DVK Oct 08 '09 at 03:14
  • You might be interested in the solution I've found for my own problem which was quite similar to yours: http://stackoverflow.com/questions/13378382/most-efficient-fastest-way-to-get-a-single-file-from-a-directory – gaborous Oct 20 '13 at 22:10

14 Answers14

11

As has already been mentioned, this is basicly a hardware problem. Disk access is always slow, and most file systems aren't really designed to handle directories with that many files.

If you for some reason have to store all the files in the same directory, I think you'll have to maintain your own cache. This could be done using a local database such as sqlite, HeidiSQL or HSQL. If you want extreme performance, use a java TreeSet and cache it in memory. This means at the very least that you'll have to read the directory less often, and it could possibly be done in the background. You could reduce the need to refresh the list even further by using your systems native file update notification API (inotify on linux) to subscribe to changes to the directory.

This doesn't seem to be possible for you, but I once solved a similiar problem by "hashing" the files into subdirectories. In my case, the challenge was to store a couple of millions images with numeric ids. I constructed the directory structure as follows:

images/[id - (id % 1000000)]/[id - (id % 1000)]/[id].jpg

This has worked well for us, and it's the solution that I would recommend. You could do something similiar to alpha-numeric filenames by simply taking the first two letters of the filename, and then the next two letters. I've done this as well once, and it did the job as well.

Emil H
  • 39,840
  • 10
  • 78
  • 97
8

Do you know the finite list of possible subdirectory names? If so, use a loop over all possible names and check for directory's existence.

Otherwise, you can not get ONLY directory names in most underlying OSs (e.g. in Unix, the directory listing is merely reading contents of "directory" file, so there's no way to find "just directories" quickly without listing all the files).

However, in NIO.2 in Java7 (see http://java.sun.com/developer/technicalArticles/javase/nio/#3 ), there's a way to have a streaming directory list so you don't get a full array of file elements cluttering your memory/network.

DVK
  • 126,886
  • 32
  • 213
  • 327
  • 1
    Even if 1.7 were out, wouldn't you still have to go through the entire stream to see if you got all the subdirectories, so this is only a tiny memory optimization? – Hardwareguy Jun 23 '09 at 21:00
  • I assume (for lack of precise documentation) that streaming would avoid keeping already-iterated stuff in memory. – DVK Jun 23 '09 at 21:42
  • Using the nio streaming helps performance see this answer with examples https://stackoverflow.com/questions/31706058/get-large-directory-content-faster-java-io-file-alternatives – Luke May 13 '19 at 23:54
7

There's actually a reason why you got the lectures: it's the correct answer to your problem. Here's the background, so that perhaps you can make some changes in your live environment.

First: directories are stored on the filesystem; think of them as files, because that's exactly what they are. When you iterate through the directory, you have to read those blocks from the disk. Each directory entry will require enough space to hold the filename, and permissions, and information on where that file is found on-disk.

Second: directories aren't stored with any internal ordering (at least, not in the filesystems where I've worked with directory files). If you have 150,000 entries and 2 sub-directories, those 2 sub-directory references could be anywhere within the 150,000. You have to iterate to find them, there's no way around that.

So, let's say that you can't avoid the big directory. Your only real option is to try to keep the blocks comprising the directory file in the in-memory cache, so that you're not hitting the disk every time you access them. You can achieve this by regularly iterating over the directory in a background thread -- but this is going to cause undue load on your disks, and interfere with other processes. Alternatively, you can scan once and keep track of the results.

The alternative is to create a tiered directory structure. If you look at commercial websites, you'll see URLs like /1/150/15023.html -- this is meant to keep the number of files per directory small. Think of it as a BTree index in a database.

Of course, you can hide that structure: you can create a filesystem abstraction layer that takes filenames and automatically generates the directory tree where those filenames can be found.

kdgregory
  • 38,754
  • 10
  • 77
  • 102
  • So did I get downvoted because I gave an invalid answer (and if yes, please correct me) or because I gave an answer that you didn't like? – kdgregory Jun 23 '09 at 20:59
  • I didn't downvote you but it's pretty clear in the question what he's asking; help finding a directory quickly, not help organizing his files. – Hardwareguy Jun 23 '09 at 21:02
  • 3
    I'm not the one that downvoted it, but you're making a lot of claims about the internal workings of filesystems without any references whatsoever, and without even knowing which file system is actually being used. That makes me a bit skeptical about the correctness of your post, although I'd love to be proved wrong. – Emil H Jun 23 '09 at 21:05
  • @hardwareguy - The point of my posting was that you're paying IO penalties by having large directory structures – kdgregory Jun 23 '09 at 21:21
  • 1
    @emil - all directory structures share similar characteristics: they maintain a list of files. Some OS's may have a special "retrieve all sub-directories" syscall; I've never seen one for Unix or Windows, but the system calls for both systems have grown past where I even try to keep track. As I noted in my comment, I'd prefer a real example rather than an anonymous downvote. Regardless, it's not exposed through the Java API. – kdgregory Jun 23 '09 at 21:25
  • 1
    Regarding references, here's one for EXT2, which (with its sibling EXT3) is one of the more common filesystems on Linux: http://www.nongnu.org/ext2-doc/ext2.html#DIRECTORY -- it was the top result from Google, but I can't speak to the credentials of the author. Personally, I haven't worked with directory data since Linux SVR3. – kdgregory Jun 23 '09 at 21:26
  • 1
    I think that this is still useful despite not answering the particular question. It takes a step back and provides some background about why the directory querying is slow and some solutions. Sometimes the answer to a question is to ask whether you're asking the *right* question (sorry if that sounds a little 'meta') – Brian Agnew Jun 23 '09 at 21:26
  • s/Linux/Unix/ (and some more characters so I can enter this comment) – kdgregory Jun 23 '09 at 21:28
6

The key problem could be File.isDirectory() function called in a loop.

File.isDirectory() can be extremely slow. I saw NFS take 10 seconds to process 200 file directory.

If you can by all means prevent File.isDirectory() calls (e.g. test for extension, no extension == directory), you could improve the performance drastically.

Otherwise I would suggest doing JNA/JNI/writing a native script that does this for you.

The jCifs library lets you manipulate windows network shares more efficiently. I am not aware of a library that would do this for other network file systems.

Roman Zenka
  • 3,514
  • 3
  • 31
  • 36
  • 3
    Directories can have an extension. Files can omit extensions. So your answer doesn't go on. – BalusC Nov 06 '09 at 16:03
  • 1
    @BalusC Yes. But sometimes you have the naming under control - for instance you know files are images with an extension from a given set, and directories are always produced without a dot. If that is the case, you can speed things up a lot. – Roman Zenka Aug 22 '11 at 21:42
6

I came across similar question when debugging performance in a Java application enumerating plenty of files. It is using old approach

for (File f : new File("C:\\").listFiles()) {
    if (f.isDirectory()) {
        continue;
    }        
}

And it appears that each f.isDirectory() is the call into native FileSsystem which, at least on NTFS, is very slow. Java7 NIO has additional API, but not all methods are good there. I'll just provide JMH benchmark result here

Benchmark                  Mode  Cnt  Score    Error  Units
MyBenchmark.dir_listFiles  avgt    5  0.437 ?  0.064   s/op
MyBenchmark.path_find      avgt    5  0.046 ?  0.001   s/op
MyBenchmark.path_walkTree  avgt    5  1.702 ?  0.047   s/op

Number come from execution of this code:

java -jar target/benchmarks.jar -bm avgt -f 1 -wi 5 -i 5 -t 1

static final String testDir = "C:/Sdk/Ide/NetBeans/src/dev/src/";
static final int nCycles = 50;

public static class Counter {
    int countOfFiles;
    int countOfFolders;
}

@Benchmark
public List<File> dir_listFiles() {
    List<File> files = new ArrayList<>(1000);

    for( int i = 0; i < nCycles; i++ ) {
        File dir = new File(testDir);

        files.clear();
        for (File f : dir.listFiles()) {
            if (f.isDirectory()) {
                continue;
            }
            files.add(f);
        }
    }
    return files;
}

@Benchmark
public List<Path> path_walkTree() throws Exception {
    final List<Path> files = new ArrayList<>(1000);

    for( int i = 0; i < nCycles; i++ ) {
        Path dir = Paths.get(testDir);

        files.clear();
        Files.walkFileTree(dir, new SimpleFileVisitor<Path> () {
            @Override
            public FileVisitResult visitFile(Path path, BasicFileAttributes arg1) throws IOException {
                files.add(path);
                return FileVisitResult.CONTINUE;
            }

            @Override
            public FileVisitResult preVisitDirectory(Path path, BasicFileAttributes arg1) 
                    throws IOException {
                return path == dir ? FileVisitResult.CONTINUE : FileVisitResult.SKIP_SUBTREE;
            }
        });
    }

    return files;
}

@Benchmark
public List<Path> path_find() throws Exception {
    final List<Path> files = new ArrayList<>(1000);

    for( int i = 0; i < nCycles; i++ ) {
        Path dir = Paths.get(testDir);

        files.clear();
        files.addAll(Files.find(dir, 1, (path, attrs) 
                -> true /*!attrs.isDirectory()*/).collect(Collectors.toList()));
    }

    return files;
}
Xtra Coder
  • 3,389
  • 4
  • 40
  • 59
5

You could hack it if the 150k files all (or a significant number of them) had a similar naming convention like:

*.jpg
*Out.txt

and only actually create file objects for the ones you are unsure about being a folder.

Hardwareguy
  • 2,941
  • 1
  • 17
  • 13
  • This wouldn't help would it? Instead of testing each file in the FilenameFilter for isDirectory(), I would be testing for isNameSimilarTo("*.jpg")? – erotsppa Jun 23 '09 at 20:40
  • You would be doing some string operations which, while not fast, should be faster than creating 150k file objects and calling .isdirectory. You'd have to take some timings to see where the real slowdown is. – Hardwareguy Jun 23 '09 at 20:47
  • This actually improved the performance in my case from 10 seconds to 1 second. In my case, I had 1000 files and 10 directories, on a network drive. Using a FilenameFilter and skipping `new File().isDirectory` for all files ending with ".jpg" did the trick! – Thomas Jacob Feb 23 '20 at 13:52
5

I don't know if the overhead of shelling out to cmd.exe would eat it up, but one possibility would be something like this:

...
Runtime r = Runtime.getRuntime();
Process p = r.exec("cmd.exe /k dir /s/b/ad C:\\folder");
BufferedReader br = new BufferedReader(new InputStreamReader(p.getInputStream()));
for (;;) {
    String d = br.readLine();
    if (d == null)
        break;
    System.out.println(d);
}
...
  • /s means search subdirectories
  • /ad means only return directories
  • /b means return the full pathname from the root
lavinio
  • 23,931
  • 5
  • 55
  • 71
  • You could even keep a `cmd.exe` process alive and pipe a command to it for each directory you want to search. – finnw Nov 06 '09 at 16:12
2

if your OS is 'stable' give a try to JNA:

these are all "streaming API". They doesn't force you to allocate a 150k list/array before start searching. IMHO this is a great advantage in your scenario.

dfa
  • 114,442
  • 31
  • 189
  • 228
1

there is also a recursive parallel scanning at http://blogs.oracle.com/adventures/entry/fast_directory_scanning. Essentially siblings are processed in parallel. There also encouraging performance tests.

Bill the Lizard
  • 398,270
  • 210
  • 566
  • 880
dfa
  • 114,442
  • 31
  • 189
  • 228
1

Here's an off-the wall solution, and devoid of any testing at all. It's also dependent on having a filesystem that supports symbolic links. This isn't a Java solution. I suspect your problem is filesystem/OS-related, and not Java related.

Is it possible to create a parallel directory structure, with subdirectories based on initial letters of the filenames, and then symbolically link to the real files ? An illustration

/symlinks/a/b/cde

would link to

/realfiles/abcde

(where /realfiles is where your 150,000 files reside)

You'd have to create and maintain this directory structure, and I don't have enough info to determine if that's practical. But the above would create a fast(er) index into your non-hierarchical (and slow) directory.

Brian Agnew
  • 268,207
  • 37
  • 334
  • 440
0

Maybe you could write a directory searching program in C#/C/C++ and use JNI to get it to Java. Do not know if this would improve performance or not.

Nick
  • 3,217
  • 5
  • 30
  • 42
  • This is not a java problem, disk access is slow no matter what language you're using. – Hardwareguy Jun 23 '09 at 20:42
  • @Nick: It doesn't help. Java already use native libraries to access host files. – OscarRyz Jun 23 '09 at 20:49
  • @Hardwareguy: Yes, the disk is slow. But you can make life much worse by doing more I/O than needed. In a C/UNIX environment, I can do a sequential read of the entire directory and scan the results to find the directories. A less efficient solution will do a separate I/O for each entry to find out if it's a directory. So the efficiency here depends on what Java is actually doing. – Keith Smith Jun 23 '09 at 21:06
  • If you write the directory traversal entirely in native code you get back the times lost on constant managed-native transitions of the Java solution. – akarnokd Jun 23 '09 at 21:30
0

In that case you might try some JNA solution - a platform dependant directory traverser (FindFirst, FindNext on Windows) with the possibility of some iteration pattern. Also Java 7 will have much better file system support, worth checking out the specs (I don't remember any specifics).

Edit: An idea: one option is to hide the slowness of the directory listing from the user's eyes. In a client side app, you could use some animation while the listing is working to distract the user. Actually depends on what else your application does beside the listing.

akarnokd
  • 69,132
  • 14
  • 157
  • 192
  • This won't help the underlying problem of filesystem access being slow for folders with that many files in them. This isn't a java problem. – Hardwareguy Jun 23 '09 at 20:39
  • FindFirst allows you to filter on directories explicitely. I don't know about readdir. I think most modern filesystems can leverage this. – akarnokd Jun 23 '09 at 21:28
0

Well, either JNI, or, if you say your deployment is constant, just run "dir" on Windows or "ls" on *nixes, with appropriate flags to list only directories (Runtime.exec())

Yoni Roit
  • 28,280
  • 7
  • 36
  • 32
0

As of 2020, the DirectoryStream does seem to be faster than using File.listFiles() and checking each file with isDirectory().

I learned the answer from here:

https://www.baeldung.com/java-list-directory-files

I'm using Java 1.8 on Windows 10.

user1324109
  • 451
  • 4
  • 10
  • and here: https://stackoverflow.com/questions/5125242/java-list-only-subdirectories-from-a-directory-not-files/5125258 I think that other StackOverflow page is more current and more inclusive. – user1324109 May 27 '20 at 01:16