I've written the following program as a quick experiment to deduplicate files using their MD5 hash
import java.nio.file.{Files, Paths}
import java.security.MessageDigest
object Test {
def main(args: Array[String]) = {
val startTime = System.currentTimeMillis();
val byteArray = Files.readAllBytes(Paths.get("/Users/amir/pgns/bigPGN.pgn"))
val endTime = System.currentTimeMillis();
println("Read file into byte " +byteArray+ " in " + (endTime - startTime) +" ms");
val startTimeHash = System.currentTimeMillis();
val hash = MessageDigest.getInstance("MD5").digest(byteArray)
val endTimeHash = System.currentTimeMillis();
System.out.println("hashed file into " +hash+ " in " +(endTime - startTime)+ " ms");
}
}
and I'm noticing that when my pgn file is about 1.5 GB of text data, it takes about 2.5 seconds to read the file, and 2.5 seconds to hash it.
My question is, is there a faster way to do this if I have a large number of files?