Accelerating files md5 checksum in Java

Question

I'm working on a program that manages backups.

For this I worked on a method that hash (MD5) each file of the disk that has to be inspected to detect if there are copies because i want to detect them and inform the user about it. I used the apache library as described here.

Problem is that the program should manage big amounts of data from many different types (videos, music, letters, everything you may want to backup) so the time to hash can become very long (I timed the hash of a large video of 1.6 Gb, it takes nearly 25 secs).

So you can imagine the time it would take to hash hundreds of Gigs...

I already try to split the work with threads, hashing many files in the "same" time, here is my run() method:

public void run() {
    running = true;
    while (running) {
        System.out.println("Thread: " + this.getName() + " starting");
        files = path.listFiles();
        if (files != null) {
            for (File file : files) {
                if (file.isDirectory()) {
                    System.out.println(file.getName());
                    dtm.countDirectory();
                    DetectorThread dt = new DetectorThread(dtm, file);
                    dt.start();
                    dtm.resetTimer();
                } else if (file.isFile()) {
                    String hash = h.hash(file.getAbsolutePath());
                    System.out.println(file.getName() + "\t" + hash);
                    dtm.countFile();
                    dtm.addFile(file, hash);
                    dtm.resetTimer();
                }
            }
        }
        dtm.resetTimer();
        running = false;
        System.out.println("Thread: " + this.getName() + " terminated");
    }
}

You give the thread a path and he will start another thread for each subfolder.

With this code I ended with 35 minutes of work for less than 100 Gigs so I wonder if there is a simpler way to find a unique ID for a file, to detect copies, or a faster way to hash or maybe i did something wrong with threads.

Any idea who would permit to accelerate this treatment is welcome.

Thank you in advance.

PS: My computer isn't bad so it's not about performances.

I didn't know a video *file* had a weight, and how heavy is a "Go"? Did you mean "a *large* video of 1.6 Gb"? — Andreas, Sep 12 '15 at 19:21
rsync uses a combination of a fast rolling checksum (Adler32 afaik) and a slower MD4 block hash - the technical report is at https://rsync.samba.org/tech_report. — tonys, Sep 12 '15 at 19:29

Andreas · Answer 1 · 2015-09-12T19:34:04.267

0

There is really no need to hash everything.

Start by looking at the file size. If no other file has the same size, your check is complete and you don't waste time scanning the entire file to hash it.

Large files are very likely unique in size, so you likely end up only hashing some of the smaller files.

FYI: Your performance is very likely entirely disk bound, meaning the multi-threaded code is spending most of the time waiting for the harddisk to return data.

You can confirm this by monitoring the system. The harddisk light will stay on (won't blink like it usually does), and the CPU will be idle.

Only way to be faster is to read less.

edited Sep 12 '15 at 19:34

answered Sep 12 '15 at 19:24

Andreas

154,647
11
152
247

Ok i will try to modify the method but there are some situations where it may not work. If the user wants to manage series, the files will have nearly same size.. Is it really a problem or I didn't understand your solution? – Axel S. Sep 12 '15 at 19:30
*Nearly* same size is not *exactly* same size. If they differ in length by a single byte, *they cannot be the same file*. – Andreas Sep 12 '15 at 19:31

score 0 · Accepted Answer · answered Sep 12 '15 at 19:35

It seems to me that this code will create too many threads. Each thread creation has a relatively high cost.

In addition, too many threads reading files at the same time, will result of unefficient I/O : when one thread reads a bunch of data, the system usually load a full block in cache, to fasten the upcoming access. When many threads read big blocks simultaneously, the system will discard these caches, forcing extra disks access.

A quick and easy fix will be to use a ThreadPool, limiting the number of executable threads to a fix number. The ideal number will probably be around your number of CPU cores. Your DetectorThread will have to implement Callable.

You'll face another issue if most big files are stored in a limited number of directories : a single thread will have to parse them all, sequentially. It's probably best to have one single thread recursively scan the directories, creating a Callable for each file.

Ok I think I will try to combine both ideas from your answers. Thank you! — Axel S., Sep 12 '15 at 19:48

Accelerating files md5 checksum in Java

2 Answers2