3

I've written the following program as a quick experiment to deduplicate files using their MD5 hash

import java.nio.file.{Files, Paths}
import java.security.MessageDigest

object Test {

  def main(args: Array[String]) = {

    val startTime = System.currentTimeMillis();
    val byteArray = Files.readAllBytes(Paths.get("/Users/amir/pgns/bigPGN.pgn"))
    val endTime = System.currentTimeMillis();
    println("Read file into byte " +byteArray+ " in " + (endTime - startTime) +" ms");

    val startTimeHash = System.currentTimeMillis();
    val hash = MessageDigest.getInstance("MD5").digest(byteArray)
    val endTimeHash = System.currentTimeMillis();
    System.out.println("hashed file into " +hash+ " in " +(endTime - startTime)+ " ms");
  }
}

and I'm noticing that when my pgn file is about 1.5 GB of text data, it takes about 2.5 seconds to read the file, and 2.5 seconds to hash it.

My question is, is there a faster way to do this if I have a large number of files?

Amir Afghani
  • 37,814
  • 16
  • 84
  • 124

1 Answers1

8

Yes there is: don't read all of the file into memory! Here is something which in theory should be faster, although I don't have any giant files to test this on

import java.security.{MessageDigest, DigestInputStream}
import java.io.{File, FileInputStream}

// Compute a hash of a file
// The output of this function should match the output of running "md5 -q <file>"
def computeHash(path: String): String = {
  val buffer = new Array[Byte](8192)
  val md5 = MessageDigest.getInstance("MD5")

  val dis = new DigestInputStream(new FileInputStream(new File(path)), md5)
  try { while (dis.read(buffer) != -1) { } } finally { dis.close() }

  md5.digest.map("%02x".format(_)).mkString
}

If everything behaves as I think it should, this avoids holding onto all the bytes in memory - as it reads chunks, it consumes them straight into the hash. Note that you can increase the buffer size to make things go faster...

Alec
  • 31,829
  • 7
  • 67
  • 114
  • This approach seems to cut down the total time by 50% on my box! Cool, thanks. – Amir Afghani Jan 13 '17 at 22:13
  • Is there a reason for `8192`? Can I improve it by matching disk cluster size? Or maybe to match some MD5 internal "data structures"? – icl7126 Jun 15 '18 at 08:22
  • @icl7126 you can set the buffer size to whatever you want. – Alec Jun 15 '18 at 08:24
  • Hi @Alec - is there a chance the while loop could run indefinitely ? – Subodh Bisht Sep 28 '20 at 15:04
  • 1
    @Subodh I don't think so if your file is finite, but I can imagine it blocking and hanging since `dis.read(buffer)` can block (see the Javadoc for details) and https://stackoverflow.com/questions/15218750/when-does-fileinputstream-read-block for concrete examples. The best way to find out might be to actually `println` something in the `while` loop: you'll get to see if it is blocking as I hypothesize or actually looping around indefinitely. – Alec Sep 28 '20 at 15:09