Many ideas come up, starting from the naive (I will check out each rev, run diff -rUN
, diffstat it, condense it to a number...) which is not workable when you have thousands of files and thousands of commits to cover to the insane (I will run Which commit has this blob? over every file and commit, put it in some database and write some query...) to an actually workable one loosely based on the linked answer.
The idea is that we first store the hashes of the current files and then compare it to list the hashes of every blob in a given commit and score the match.
- The scoring program is simply
grep
, it can read a list of strings (patterns even but we have strings) and count how many times those strings occur in the input.
git ls-tree -r
dumps the blob hashes in a commit (and more but we do not care about that)
git hash-object
produces the same hash as git ls-tree
for existing files.
I used a tmpfs -- while premature optimization might be the root of all evil, this optimization costs so little in effort I found it easier. I had this script in the root:
#!/bin/sh
echo "$(git ls-tree -r $1|grep -c -F -f ../hashes.txt) $1"
and put the problematic codebase under mess
and the pristine git clone under base
.
cd mess
find . -type f -print0| xargs -0 -P8 git hash-object >> ../hashes.txt
cd ../base
git log --all --format=%H |xargs -n1 -P8 ../script.sh |sort -n|tail
This finished in a few minutes (but I cheated a little because I had some date limits on git log
but given the use case it's likely you will have those too). My output looks like this:
9548 0ceb441a75cd4cd11427da2b37efd49c99f9e562
9549 8f2c0537da72bb7ca866e6847bf887811ab3c72e
9550 5cd36afbe23310c17caf4075d29c70a4b2252295
9550 8da13e6c60255d2b8008d8de3d3e64de91d2bf7a
9551 2be39c73876f9d22f8cea40777d082e3fba4cbd4
Clearly 2be39c7
has 9551 matching files and it's not some broken outlier as the "neigbhouring" commits has very similar but lower numbers.