Optimize: Calculating MD5 hash of large number of files recursively under a root folder

Question

My Current way of generating MD5 hashes of all files under a root directory, up to a given depth is shown below.

As of now, it takes about 10 seconds (old intel core i3 cpu) to process appx 300 images, each on average 5-10 MB in size. The parallel option in stream does not help. With or without it, the time remains more or less same. How can I make this faster ?

Files.walk(Path.of(rootDir), depth)
            .parallel() // doesn't help, time appx same as without parallel
            .filter(path -> !Files.isDirectory(path)) // skip directories
            .map(FileHash::getHash)
            .collect(Collectors.toList());

The getHash method used above gives a comma separated hash,<full file path> output line for each file being processed in the stream.

public static String getHash(Path path) {
    MessageDigest md5 = null;
    try {
      md5 = MessageDigest.getInstance("MD5");
      md5.update(Files.readAllBytes(path));
    } catch (Exception e) {
      e.printStackTrace();
    }
    byte[] digest = md5.digest();
    String hash = DatatypeConverter.printHexBinary(digest).toUpperCase();
    return String.format("%s,%s", hash, path.toAbsolutePath());
  }

@NikolaiDmitriev I was thinking that since the hashing function is pure, there should be a way to run the processes in parallel ? But yeah, IO bound on the hard disk is a possibility, hence asking — Somjit, Nov 15 '20 at 15:38

score 1 · Accepted Answer · answered Nov 15 '20 at 19:03

1

The stream returned by Files.walk(Path.of(rootDir), depth) cannot be parallelized effeciently (he has no size so it's difficult to determine slice to parallelize). In your case for improving performance you need to collect first the result of Files.walk(...).

So your have to do:

Files.walk(Path.of(rootDir), depth)
        .filter(path -> !Files.isDirectory(path)) // skip directories
        .collect(Collectors.toList())
        .stream()
        .parallel() // in my computer divide the time needed by 5 (8 core cpu and SSD disk)
        .map(FileHash::getHash)
        .collect(Collectors.toList());

answered Nov 15 '20 at 19:03

Olivier Pellier-Cuit

1,269
9
15

Fantastic! got a 3x speed up! can you please give some links to help understand what happened ? – Somjit Nov 15 '20 at 19:14
1

Sure read accepted answer in this post: https://stackoverflow.com/questions/34341656/why-is-files-list-parallel-stream-performing-so-much-slower-than-using-collect – Olivier Pellier-Cuit Nov 15 '20 at 19:15

Optimize: Calculating MD5 hash of large number of files recursively under a root folder

1 Answers1