2

What's the easiest way to get a hash-function of a directory in Linux (preferably using shell scripting or Python)?

What I'm trying to do is find duplicate subtrees within a large tree of directories.

fdupes and meld etc. tend to want the two trees to be largely isomorphic, ie. given

A
└─ B

and

A
└─ C
   └─ B

they won't alert me if B is the same in both trees because in the second tree its under C.

So I'm guessing I need to write my own script to recurse down both trees and find hashes of all subtrees and then compare them.

AlastairG
  • 4,119
  • 5
  • 26
  • 41
interstar
  • 26,048
  • 36
  • 112
  • 180
  • with same you mean the same inode or the same directory name? – hansaplast Jan 24 '17 at 16:14
  • You just want to find directories with same names ? – Cyrbil Jan 24 '17 at 16:16
  • I mean "the same" as in all the files within them (including, recursively, all files in all subdirectories) are identical. This is for deduping purposes. – interstar Jan 24 '17 at 16:43
  • Duplicate of: [Linux: compute a single hash for a given folder & contents?](http://stackoverflow.com/q/545387/950485) – rld. Jan 25 '17 at 05:31
  • @rld I need to hash the contents of the files as well just their names. – interstar Jan 25 '17 at 12:37
  • Could you comment on the accepted answer to point out the version you ended up using, for future reference? – rld. Mar 08 '17 at 06:10
  • @rld. Basically I accepted it for the idea of doing a hash of the result of running "find" on each directory. I'm not using any of the suggestions exactly, but I'm constructing a solution (still tweaking it) based on that idea. – interstar Mar 09 '17 at 21:45

1 Answers1

2

Hash a directory's structure using filenames only

List all filepaths in the dir (recursively), sort them (in case find messes up), hash it all with sha1sum and print the hash:

find /my/dir -mindepth 1 -type f -print0 | sort -z | sha1sum

You can put that in a script, like:

#!/bin/bash
# hashtree-names.sh - hash a dir's structure by filenames
# (files with same names are considered identical)
# Usage: hashtree-names.sh <dirname>
DIR=$1
find $DIR -mindepth 1 -type f -print0 | sort -z | sha1sum

And execute it on every dir under a large tree like so:

find /my/tree -mindepth 1 -type d -exec hashtree-names.sh {} \; | sort

Which will produce output similar to:

3cd8fea391f3055d9de3d6e05a422b6e97ce4204 *-
8cd93d83e9baeea479785fe0cc03c8b58aa293a3 *-
8cd93d83e9baeea479785fe0cc03c8b58aa293a3 *-
fe7dd981bb0d978608ba648eb3d38bb41f6cd956 *-
afc483808be60fbd48e716a7b916b5deaa9c78b5 *-
a518cfa27e7e9afbab2ba2209c80dbab0631736b *-
251f3cfc11eeccdfaf28142dadc5aa3aa4e2aec1 *-
251f3cfc11eeccdfaf28142dadc5aa3aa4e2aec1 *-
4a689e7c27733498c4ac5730f172c844cb6b21d1 *-
600a61b8c1a973aa6322ab4a7d57f7c07174e0ec *-
a401f27520252ae334625ca1b452396f0287f42d *-
e0b2d5f825f062d40f0f2490673888b5eb6c66fd *-
85a533625c5a38892d392f2ae9e7974e3eceaf6a *-

Hash a directory's structure, complete with file contents

See Vatine's and David Schmitt's answers to Linux: compute a single hash for a given folder & contents?.

EDIT 2017-01-27

  • Code improvements: Added -mindepth 1 to find
Community
  • 1
  • 1
rld.
  • 180
  • 9