I am about to embark on a programming journey, which undoubtedly will end in failure and/or throwing my mouse through my Mac, but it's an interesting problem.
I want to build an app, which scans starting at some base directory and recursively loops down through each file, and if it finds an exact duplicate file, it deletes it, and makes a symbolic link in its place. Basically poor mans deduplication. This actually solves a real problem for me, since I have a bunch of duplicate files on my Mac, and I need to free up disk space.
From what I have read, this is the strategy:
Loop through recursively, and generate a hash for each file. The hash need to be extremely unique. This is the first problem. What hash should I use? How do I run the entire binary contents of each file through this magical hash?
Store each files hash and full-path in a key/value store. I'm thinking redis is an excellent fit because of its speed.
Iterate through the key/value store, find duplicate hashes, delete the duplicate file, create the symbolic link, and flag the row in the key/value store as a copy.
My questions therefore are:
- What hashing algorithm should I use for each file? How is this done?
- I'm thinking about using node.js because node generally is fast at i/o types of things. The problem is that node sucks at CPU intensive stuff, so the hashing will probably be the bottleneck.
- What other gotchas am I missing here?