If you know in advance the target OS set (usually it is the case), ultimately the fastest way will be to list so many files through a shell, by invoking a process e.g. using Runtime.exec.
On Windows you can do with
dir /s /b
On Linux
ls -R -1
You can check what is the OS and use appropriate command (error or resort to directory stream if not supported).
If you wish simplicity and don't need to report a progress, you can avoid dealing with the process IO and store the list to a temporary file e.g. ls -R -1 > /tmp/filelist.txt
. Alternatively, you can read from the process output directly. Read with a buffered stream, a reader or alike, with large enough buffer.
On SSD it will complete in a blink of an eye and on modern HDD in seconds (half million files is not a problem with this approach).
Once you have the list, you can approach it differently depending on maximum files count and memory requirements. If requirements are loose, e.g. desktop program, you can do with very simple code e.g. pre-loading the complete file list to a HashSet and check existence when needed. Shortening path by removing common root will require much less memory. You can also reduce memory by keeping only filename hash instead of full name (common root removal will probably reduce more).
Or you can optimize it further if you wish, the question just reduces now to a problem of checking existense of a string in a list of strings stored in memory or file, which has many well known optimal solutions.
Bellow is very loose, simplistic sample for Windows. It executes dir on HDD (not SSD) drive root with ~400K files, reads the list and benchmarks (well, kind of) time and memory for string set and md5 set approaches:
public static void main(String args[]) throws Exception {
final Runtime rt = Runtime.getRuntime();
System.out.println("mem " + (rt.totalMemory() - rt.freeMemory())
/ (1024 * 1024) + " Mb");
long time = System.currentTimeMillis();
// windows command: cd to t:\ and run recursive dir
Process p = rt.exec("cmd /c \"t: & dir /s /b > filelist.txt\"");
if (p.waitFor() != 0)
throw new Exception("command has failed");
System.out.println("done executing shell, took "
+ (System.currentTimeMillis() - time) + "ms");
System.out.println();
File f = new File("T:/filelist.txt");
// load into hash set
time = System.currentTimeMillis();
Set<String> fileNames = new HashSet<String>(500000);
try (BufferedReader reader = new BufferedReader(new InputStreamReader(
new FileInputStream(f), StandardCharsets.UTF_8),
50 * 1024 * 1024)) {
for (String line = reader.readLine(); line != null; line = reader
.readLine()) {
fileNames.add(line);
}
}
System.out.println(fileNames.size() + " file names loaded took "
+ (System.currentTimeMillis() - time) + "ms");
System.gc();
System.out.println("mem " + (rt.totalMemory() - rt.freeMemory())
/ (1024 * 1024) + " Mb");
time = System.currentTimeMillis();
// check files
for (int i = 0; i < 70_000; i++) {
StringBuilder fileToCheck = new StringBuilder();
while (fileToCheck.length() < 256)
fileToCheck.append(Double.toString(Math.random()));
if (fileNames.contains(fileToCheck))
System.out.println("to prevent optimization, never executes");
}
System.out.println();
System.out.println("hash set 70K checks took "
+ (System.currentTimeMillis() - time) + "ms");
System.gc();
System.out.println("mem " + (rt.totalMemory() - rt.freeMemory())
/ (1024 * 1024) + " Mb");
// Test memory/performance with MD5 hash set approach instead of full
// names
time = System.currentTimeMillis();
Set<String> nameHashes = new HashSet<String>(50000);
MessageDigest md5 = MessageDigest.getInstance("MD5");
for (String name : fileNames) {
String nameMd5 = new String(md5.digest(name
.getBytes(StandardCharsets.UTF_8)), StandardCharsets.UTF_8);
nameHashes.add(nameMd5);
}
System.out.println();
System.out.println(fileNames.size() + " md5 hashes created, took "
+ (System.currentTimeMillis() - time) + "ms");
fileNames.clear();
fileNames = null;
System.gc();
Thread.sleep(100);
System.gc();
System.out.println("mem " + (rt.totalMemory() - rt.freeMemory())
/ (1024 * 1024) + " Mb");
time = System.currentTimeMillis();
// check files
for (int i = 0; i < 70_000; i++) {
StringBuilder fileToCheck = new StringBuilder();
while (fileToCheck.length() < 256)
fileToCheck.append(Double.toString(Math.random()));
String md5ToCheck = new String(md5.digest(fileToCheck.toString()
.getBytes(StandardCharsets.UTF_8)), StandardCharsets.UTF_8);
if (nameHashes.contains(md5ToCheck))
System.out.println("to prevent optimization, never executes");
}
System.out.println("md5 hash set 70K checks took "
+ (System.currentTimeMillis() - time) + "ms");
System.gc();
System.out.println("mem " + (rt.totalMemory() - rt.freeMemory())
/ (1024 * 1024) + " Mb");
}
Output:
mem 3 Mb
done executing shell, took 5686ms
403108 file names loaded took 382ms
mem 117 Mb
hash set 70K checks took 283ms
mem 117 Mb
403108 md5 hashes created, took 486ms
mem 52 Mb
md5 hash set 70K checks took 366ms
mem 48 Mb