I'm working on a program that manages backups.
For this I worked on a method that hash (MD5) each file of the disk that has to be inspected to detect if there are copies because i want to detect them and inform the user about it. I used the apache library as described here.
Problem is that the program should manage big amounts of data from many different types (videos, music, letters, everything you may want to backup) so the time to hash can become very long (I timed the hash of a large video of 1.6 Gb, it takes nearly 25 secs).
So you can imagine the time it would take to hash hundreds of Gigs...
I already try to split the work with threads, hashing many files in the "same" time, here is my run() method:
public void run() {
running = true;
while (running) {
System.out.println("Thread: " + this.getName() + " starting");
files = path.listFiles();
if (files != null) {
for (File file : files) {
if (file.isDirectory()) {
System.out.println(file.getName());
dtm.countDirectory();
DetectorThread dt = new DetectorThread(dtm, file);
dt.start();
dtm.resetTimer();
} else if (file.isFile()) {
String hash = h.hash(file.getAbsolutePath());
System.out.println(file.getName() + "\t" + hash);
dtm.countFile();
dtm.addFile(file, hash);
dtm.resetTimer();
}
}
}
dtm.resetTimer();
running = false;
System.out.println("Thread: " + this.getName() + " terminated");
}
}
You give the thread a path and he will start another thread for each subfolder.
With this code I ended with 35 minutes of work for less than 100 Gigs so I wonder if there is a simpler way to find a unique ID for a file, to detect copies, or a faster way to hash or maybe i did something wrong with threads.
Any idea who would permit to accelerate this treatment is welcome.
Thank you in advance.
PS: My computer isn't bad so it's not about performances.