Building A Deduplication Application For OS X, What/How Should I Use As The Hash For Files

Question

I am about to embark on a programming journey, which undoubtedly will end in failure and/or throwing my mouse through my Mac, but it's an interesting problem.

I want to build an app, which scans starting at some base directory and recursively loops down through each file, and if it finds an exact duplicate file, it deletes it, and makes a symbolic link in its place. Basically poor mans deduplication. This actually solves a real problem for me, since I have a bunch of duplicate files on my Mac, and I need to free up disk space.

From what I have read, this is the strategy:

Loop through recursively, and generate a hash for each file. The hash need to be extremely unique. This is the first problem. What hash should I use? How do I run the entire binary contents of each file through this magical hash?
Store each files hash and full-path in a key/value store. I'm thinking redis is an excellent fit because of its speed.
Iterate through the key/value store, find duplicate hashes, delete the duplicate file, create the symbolic link, and flag the row in the key/value store as a copy.

My questions therefore are:

What hashing algorithm should I use for each file? How is this done?
I'm thinking about using node.js because node generally is fast at i/o types of things. The problem is that node sucks at CPU intensive stuff, so the hashing will probably be the bottleneck.
What other gotchas am I missing here?

I suggest you make this a Unix journey on your Mac OS X. Everything you need to build an autonomous daemon monitoring your designated folders for 'dedup' is there. +1 to SHA-1. — alphazero, Nov 28 '11 at 16:11

score 4 · Accepted Answer · edited May 23 '17 at 12:20

What hashing algorithm should I use for each file? How is this done?

Use SHA1. Git uses SHA1 to generate unique hash for files. It's almost impossible to have a collision. There is no known collision of standard SHA1.

I'm thinking about using node.js because node generally is fast at i/o types of things. The problem is that node sucks at CPU intensive stuff, so the hashing will probably be the bottleneck.

Your application will have 2 kinds of operation:

Reading file (IO bound).
Calculating hash (CPU bound).

My suggestion is: don't calculate hash in scripting language (Ruby or JavaScript) unless it has native hashing library. You can just invoke other executables such as sha1sum. It's written in C and should be blazing fast.

I don't think you need NodeJS. NodeJS is fast in event-driven IO, but it cannot boost your I/O speed. I don't think you need to implement event-driven IO here.

What other gotchas am I missing here?

My suggestions: Just implement with a language which you are familiar with. Don't over-engineering too early. Optimize it only when you really hit performance issue.

score 0 · Answer 2 · answered Aug 24 '14 at 22:26

A little late but I used miaout's advice and came up with this...

var exec = require('child_process').exec;
exec('openssl sha1 "'+file+'"', { maxBuffer: (200*10240) }, function(p_err, p_stdout, p_stderr) {
  var myregexp = /=\s?(\w*)/g;
  var match = myregexp.exec(p_stdout);
  fileInfo.hash = "Fake hash";
  if (match != null) {
    fileInfo.hash = match[1];
  }
  next()
});

You could use sha1sum but like every other great piece of software it will require something like homebrew to install. Of course you could also compile it yourself if you have the environment for it.

Building A Deduplication Application For OS X, What/How Should I Use As The Hash For Files

2 Answers2