How to diff md5 sums of two filesystem states?

Question

I'm collecting md5sum snapshots of the same filesystem at two different points in time. (ie, Before and after an infection.) I need to diff these two states in order to see what files change between these two points in time.

To collect these states I might do the following (on macOS with SIP turned off):

sudo gfind / ! -path '*/dev/*' ! -path '*/Network/*' ! -path '*/Volumes/*' ! -path '*/.fseventsd/*' ! -path '*/.Spotlight-V100/*' -type f -exec md5sum {} \; > $(date "+%y%m%d%H%M%S").system_listing

The problem I'm having is that the resultant files are around 100MB a piece and using diff by itself seems to compare chunks instead of each individual file's md5sum in the output.

Is there an efficient way of using diff tools to do this or is it necessary to write a script to somehow compare the two files based upon filename paths, effectively recreating diff to compare lines with path as the unique comparator value and then return info based on the associated md5sum?

shot in the dark: try sort both states files by the path and filename column. — James Li, Aug 31 '19 at 23:49
if you update your question with how did you collect the md5sum snapshot, would be more helpful — James Li, Aug 31 '19 at 23:51

score 1 · Accepted Answer · answered Sep 01 '19 at 00:43

appearance of directories order could produce a lot of noisy diff
for example i ran the following two commands , diffing two directories full of pdfs.
one with 1 file , the other with tens of files swapping the directory order produce 2 diff line,
instead we want to the diff report the fact of no diff .

find books/ docs-pdf/ -type f  -exec  md5sum {} \; > snapshot1
find  docs-pdf/ books/ -type f  -exec  md5sum {} \; > snapshot2

diff snapshot1 snapshot2
--- snapshot1
+++ snapshot2
@@ -1,4 +1,3 @@
-83322cb1aaa94f9c8e87925f9d2a695e  books/ModSimPy.pdf
 192e5d38e59d8295ec9ca715e784a6d0  docs-pdf/c-api.pdf
 76c5bfb41bc6e5f9c8da1ab1f915e622  docs-pdf/distributing.pdf
 0a630ec314653c68153f5bbc4446660c  docs-pdf/extending.pdf
@@ -25,3 +24,4 @@
 31e3dc3f78a12c59cdc0426d8e75ec99  docs-pdf/tutorial.pdf
 4c59e969009b6c3372804efdfc99e2d9  docs-pdf/using.pdf
 cf5330f4ed5ca5f63f300ccfa3057825  docs-pdf/whatsnew.pdf
+83322cb1aaa94f9c8e87925f9d2a695e  books/ModSimPy.pdf

after sorting by 2nd column , diff successfully report with no diff

sort  -k2  snapshot1 >sorted.snapshot1 
sort  -k2  snapshot2 >sorted.snapshot2
diff sorted.snapshot1 sorted.snapshot2

if this did not solve all noisy diff outputs , please post out pieces of the example output you do not want

I'll take a look and report back. One problem I'm having right now is the speed component. It takes so darn long to gather this information! :) — ylluminate, Sep 01 '19 at 00:56
well speed is hard to overcome in getting system snapshot . you have to check almost EVERYTHING and file access time are NOT reliable — James Li, Sep 01 '19 at 09:24
Right, so this took over 100 minutes each, but this seems to have worked very well so far. The only problem is that by doing this I've found a couple problems - but the biggest one is that I'm realizing I need a huge exclusion list, which isn't workable directly and needs an exclusion file. I've gone ahead and [asked another question to keep this clean](https://stackoverflow.com/questions/57747393/how-to-convert-a-find-command-to-instead-use-grep-to-filter-and-then-exec-comm) if you're interested. — ylluminate, Sep 01 '19 at 15:47

How to diff md5 sums of two filesystem states?

1 Answers1