Performance optimization searching data in file system

Question

I have a network associated storage where around 5 million txt files are there related to around 3 million transactions. Size of the total data is around 3.5 TB. I have to search in that location to find if the transaction related file is available or not and have to make two separate reports as CSV file of "available files" and "not available files". We are still in JAVA 6. The challenge that I am facing since I have to search in the location recursively, it takes me around average 2 mins to search in that location because of huge size. I am using Java I/O API to search recursively like below. Is there any way I can improve the performance?

File searchFile(File location, String fileName) {
     if (location.isDirectory()) {
         File[] arr = location.listFiles();
         for (File f : arr) {
             File found = searchFile(f, fileName);
             if (found != null)
                 return found;
         }
     } else {
         if (location.getName().equals(fileName)) {
             return location;
         }
     }
     return null;
}

When you are looping through such a big number, recursion is very bad... it adds loads of overhead for JVM... but again it depends on how deep is your directory structure as well... — Ketan, Nov 19 '18 at 05:00

Lie Ryan · Answer 1 · 2018-11-19T10:41:04.537

3

You should take a different approach, rather than walking the entire directory every time you search for a file, you should instead create an index, which is a mapping from filename to file location.

Essentially:

void buildIndex(Map index, File baseDir) {
    if (location.isDirectory()) {
        File[] arr = location.listFiles();
        for (File f : arr) {
            buildIndex(index, f);
        }
    } else {
        index.put(f.getName(), f);
    }
}

Now that you've got the index, searching for the files becomes trivial.

Now you've got the files in a Map, you can also even use Set operation to find the intersection:

Map index = new HashMap();
buildIndex(index, ...);
Set fileSet = index.keySet();
Set transactionSet = ...;
Set intersection = new HashSet(fileSet);
fileSet.retainAll(transactionSet);

Optionally, if the index itself is too big to keep in memory, you may want to create the index in an SQLite database.

edited Nov 19 '18 at 10:41

answered Nov 19 '18 at 04:24

Lie Ryan

62,238
13
100
144

you are right, and in this case index itself will grow too big... so better to divide the index on various criteria, the simplest being by first character of the file name.Long back I had asked a question which is about removing duplicates but the core logic suggested by various people might be applicable... https://stackoverflow.com/questions/12501112/how-to-remove-duplicate-words-using-java-when-words-are-more-than-200-million – Ketan Nov 19 '18 at 04:57
@Ketan: I'd suggest don't bother about trying to do tricks like splitting the index by first characters or things like that. Just let SQLite deal with managing the data. However, 5 million filenames is tiny amount of data. It's just at most around a gigabyte (likely much less), which is still fairly small in modern machines. You'll likely never need to use anything other than Map unless your data is a couple order of magnitude larger than this. – Lie Ryan Nov 19 '18 at 07:35
SQLite is not an option in my plate.I could able to reduce the time by creating index but since large volume data were stored in a shared drive so it is still taking time to load all files to create the index Map. Thanks a lot for the hints of indexing! – Samarjit Baruah Nov 19 '18 at 21:01

score 1 · Accepted Answer · answered Nov 19 '18 at 20:16

Searching in a Directory or a Network Associated Storage is a nightmare.It takes lot of time when directory is too big / depth. As you are in Java 6 , So you can follow an old fashion approach. List all files in a CSV file like below.
e.g

find . -type f -name '*.txt' >> test.csv . (if unix)

dir /b/s *.txt > test.csv (if Windows)
Now load this CSV file into a Map to have an index as filename. Loading the file will take some time as it will be huge but once you load then searching in the map ( as it will be file name ) will be much more quick and will reduce your search time drastically.

Thanks for the excellent idea . I think running locally after getting the list of files as CSV using above command was really helpful .After that, I generated a HashMap from the CSV and did all my report processing. Now I am able to generate report in very less time.You saved my day. Cheers!! — Samarjit Baruah, Nov 20 '18 at 15:10

score 0 · Answer 3 · answered Nov 18 '18 at 23:52

You can use NIO FileVisitor, available in java 6.

Path findTransactionFile(Path root) {
    Path transactionFile = null;
    Files.walkFileTree(root, new SimpleFileVisitor<Path>() {
        @Override
        public FileVisitResult preVisitDirectory(Path dir, BasicFileAttributes attrs) throws IOException {
            if (/* todo dir predicate*/ false) {
                return FileVisitResult.SKIP_SUBTREE; // optimization
            }
            return FileVisitResult.CONTINUE;
        }

        @Override
        public FileVisitResult visitFile(Path file, BasicFileAttributes attrs) throws IOException {
            if (/* todo file predicate*/ true) {
                transactionFile = file;
                return FileVisitResult.TERMINATE; // found    
            }
            return FileVisitResult.CONTINUE;
        }
    });

    return transactionFile;
}

I think I'm wrong and it's available since java 7, you can use Apache VFS for similar behaviour in older java — Jazzwave06, Nov 18 '18 at 23:58

hunter · Answer 4 · 2018-11-19T04:22:03.883

I dont know the answer, but from algorithm perspective, your program has the worst complexity. per single look up for single transaction , it iterates all the files (5 million). you have 3 million transactions.

my suggestion is to iterate the files (5 million files) and build up an index based on the file name. then iterate the transactions and search the index instead of full scan. Or there might be third party free tools that can index a large file system and then that index can be accessed by an external application (in this case your java app). if you can not find that kind of tool, better you invent it (then you can build the index in a optimum way that suits your requirement).

Performance optimization searching data in file system

4 Answers4