Which Git commit is this codebase derived from?

Question

As a consultant I often find myself in the situation where I inherit a mess of a code base, started from a git clone or dev tarball at an unknown point of time. How do I find which commit it started from?

The code base is not an exact Git checkout, files are edited, added, etc.

To clarify, imagine the codebase you inherited contains a subdirectory called bootstrap. It clearly contains the Bootstrap project, and you'd like to update it. All you know is that at some point twbs/bootstrap was downloaded either by git clone git@github.com:twbs/bootstrap.git or by downloading https://github.com/twbs/bootstrap/archive/v4-dev.zip.

After this initial action, some indiscriminate hacking occurred in this subdirectory where files were changed, deleted and added. I would like to update it to the latest version. To do so, I would like to find out which Git hash the initial download corresponds to.

From the consultant's point of view, is the "mess" something that needs to be found due to legal matters, or is this a matter of attributing blame? — Makoto, Nov 25 '16 at 03:55
I wonder if it's worth the time to dig through Git to find the commit which made this particular file better than to simply rewrite it, but I won't disagree with your rationale any further. — Makoto, Nov 25 '16 at 03:59
Possible duplicate of [How to find a commit that corresponds to a project revision that is not under git control?](http://stackoverflow.com/questions/36268756/how-to-find-a-commit-that-corresponds-to-a-project-revision-that-is-not-under-gi) — Andrew C, Nov 25 '16 at 16:42
The other question will only find the *exact* checkout! My problem is much more difficult. — chx, Nov 25 '16 at 20:19

score 2 · Answer 1 · edited May 23 '17 at 12:17

Many ideas come up, starting from the naive (I will check out each rev, run diff -rUN, diffstat it, condense it to a number...) which is not workable when you have thousands of files and thousands of commits to cover to the insane (I will run Which commit has this blob? over every file and commit, put it in some database and write some query...) to an actually workable one loosely based on the linked answer.

The idea is that we first store the hashes of the current files and then compare it to list the hashes of every blob in a given commit and score the match.

The scoring program is simply grep, it can read a list of strings (patterns even but we have strings) and count how many times those strings occur in the input.
git ls-tree -r dumps the blob hashes in a commit (and more but we do not care about that)
git hash-object produces the same hash as git ls-tree for existing files.

I used a tmpfs -- while premature optimization might be the root of all evil, this optimization costs so little in effort I found it easier. I had this script in the root:

#!/bin/sh
echo "$(git ls-tree -r $1|grep -c -F -f ../hashes.txt) $1"

and put the problematic codebase under mess and the pristine git clone under base.

cd mess
find . -type f -print0| xargs -0 -P8 git hash-object >> ../hashes.txt
cd ../base
git log --all --format=%H |xargs -n1 -P8 ../script.sh |sort -n|tail

This finished in a few minutes (but I cheated a little because I had some date limits on git log but given the use case it's likely you will have those too). My output looks like this:

9548 0ceb441a75cd4cd11427da2b37efd49c99f9e562
9549 8f2c0537da72bb7ca866e6847bf887811ab3c72e
9550 5cd36afbe23310c17caf4075d29c70a4b2252295 
9550 8da13e6c60255d2b8008d8de3d3e64de91d2bf7a
9551 2be39c73876f9d22f8cea40777d082e3fba4cbd4

Clearly 2be39c7 has 9551 matching files and it's not some broken outlier as the "neigbhouring" commits has very similar but lower numbers.

Which Git commit is this codebase derived from?

1 Answers1