1

In my Go project, I have a copy of https://github.com/HouzuoGuo/tiedot made locally. This was probably made manually (or go get) couple of years ago.

I cannot tell what version/tag was checked out since that is not maintained anywhere.

Is there any way for me to find the commit hash from hash of individual files? For example the some hashes are as below:

github.com/HouzuoGuo/tiedot/db> shasum *.go
79b42b7af9784255b39b4307950709880df4a86f  col.go
b5f5a127c990229e8ac085eb8e7c72d0e6617e1c  col_test.go
be45a7eae65803df2dc31e23db7eb27bcffa17cc  db.go
290c32d11498aacb0456117f2bffa8e7ab74ccd8  db_test.go
3d0e0dc06fbd8191b5d68b32b4ac4200444e98f2  doc.go
f15745867ccfcb8609194b617cc6e8911174dad9  doc_test.go
40fcd698a680b39bd8405b9bc62d0f4b99411cbf  idx_test.go
d1c481d7d75140b229440819bb21eb64095a7b35  query.go
c83114227dc59100de953ffceb4398e4d8a6075b  query_test.go

Once I have commit has, I can add it to my go.mod file using something like go get github.com/HouzuoGuo/tiedot@<hash>

Based on suggestions from @torek below, I checked out the code from github and wrote a sample script to read all the commits and check if hash of one of the files matches. This does not work though. What am I missing?

COMMITS=$(git rev-list --all)

for COMMIT_HASH in $COMMITS
do
    TREE_HASH=$(git cat-file -p $COMMIT_HASH | grep tree | cut -d' ' -f2)
    if [[ -z "$TREE_HASH" ]]; then
        echo "Tree hash is empty"
        continue
    fi

    DB_DIR_HASH=$(git cat-file -p $TREE_HASH | grep '[[:space:]]db$' | awk '{print $3}')
    if [[ -z "$DB_DIR_HASH" ]]; then
        echo "db dir hash is empty"
        continue
    fi

    DBGO_HASH=$(git cat-file -p $DB_DIR_HASH | grep db.go | awk '{print $3}')
    if [[ -z "$DBGO_HASH" ]]; then
        echo "db.go hash is empty"
        continue
    fi

    if [[ "$DBGO_HASH" == "be45a7eae65803df2dc31e23db7eb27bcffa17cc" ]]; then
        echo "db.go hash matched!!!   Commit $COMMIT_HASH"
    fi
done
Amol
  • 1,084
  • 10
  • 20

1 Answers1

3

Is there any way for me to find the commit hash from hash of individual files?

The bad news: no, because the commit hash depends on not only the files themselves, but also the commit's metadata.

The good news: you don't need to do that, as you can simply go the other direction, from commit hash to files. That is, with a clone of the repository, walk the commit graph. For each commit you find in the process, compare the saved source snapshot to the set of files you care about.

Edit 2: Make sure the checksum you're using is the one Git would use, not the one produced by running shasum or any similar command. That is, use the git hash-object command to compute the hash IDs of the objects for which you will search. (The default is to compute a blob hash ID so you can just run git hash-object db/db.go for instance.)

You may find more than one match (which is why this is not invertible): for instance, perhaps v2.4.2 and v2.4.4 both match because v2.4.3 was broken and the bug was reverted to make v2.4.4. But that's not important, as long as the result works for you.

To compare the hashes of the sources you care about, use git ls-tree -r on the commit in question. Use git rev-list to enumerate commit hash IDs. If you have a full tree, you can speed things up by computing the tree hash and comparing the result of git rev-parse $commit^{tree} for each $commit value, rather than comparing all the file hashes of some known subset of files, but either way this should go pretty fast.

Edit: I'm not sure what is going wrong with your script, but here is a much simpler variant:

git rev-list --branches |
while read commit; do
    h=$(git rev-parse --quiet --verify $commit:db/db.go) || continue
    if [ $h == be45a7eae65803df2dc31e23db7eb27bcffa17cc ]; then
        echo "db/db.go hash matched in commit $commit"
    fi
done

Note that the file may be in many commits! When I ran a variant of this on the Git repository for Git, looking for hash ID d2632690d5107b53ee8a7ac4832cd85eb8c7bfc1 of levenshtein.c, I got 18132 commits matched (which took about ten minutes, scanning through just over 60000 commits). But, it's possible that the hash ID is in no commit: a fast way to check is to use the option in jthill's comment: git log --find-object=hash (with --all or --branches or whatever). If this turns up at least one match, then at least one commit has the object; the script will find all commits that have the object.

Using git rev-list --tags --no-walk found 181 commits in about 8 seconds:

$ time git rev-list --tags --no-walk | while read commit; do h=$(git rev-parse --quiet --verify $commit:levenshtein.c) || continue; test $h = d2632690d5107b53ee8a7ac4832cd85eb8c7bfc1 && echo "found in $commit"; done | wc -l
     181

real    0m7.810s
user    0m2.449s
sys     0m3.434s

The same thing without the script finds 772 tagged commits in 0.046s, so this script fragment handles about 100 commits per second on my old Mac laptop. (I used this to back-estimate the 10 minutes: I know it was slow!)

torek
  • 448,244
  • 59
  • 642
  • 775
  • "Walk the commit graph and compare the saved source snapshot" this sounds like going through all commits, which is a huge number. I have full tree, how do I get the tree hash of local tree? And what is $commit value in the git rev-parse? – Amol Oct 27 '20 at 23:48
  • You can, if you believe that there's a specific tag that's correct, enumerate only the *tagged* commits. Even if you walk every commit (since there might not be a tag), processing a few tens of thousands of commits should take only a few minutes at most. The shell code for this is `git rev-list | while read commit; do ...; done` and `$commit` is the hash ID read by the `read`. – torek Oct 28 '20 at 02:15
  • The set of options to pass to rev-list depends on whether you want to examine only tagged commits (`--no-walk --tags`) or all reachable commits (`--all`) or whatever. The `...` section is the test you come up with, based on hash or hashes. – torek Oct 28 '20 at 02:17
  • torek, thanks for the help. I wrote a sample script based on your steps but this does not work. Any suggestions? Added the script to the question. – Amol Oct 29 '20 at 21:18
  • Aha, I just looked at your question again and know where you have gone wrong: `shasum *.go`. This computes the SHA-1 of each of those `*.go` files, but Git doesn't use the SHA-1 checksum of the *files*. It uses the SHA-1 checksum of the (Git) *objects*. Use `git hash-object` on each `*.go` file to compute the *Git* checksum of the *object* that results from saving the given file. – torek Oct 29 '20 at 21:26
  • voila! that does the trick. Now I am able to find multiple commits that match this hash. Will try to zero down from those. Is it possible to take hash of my entire local tree so to make this more accurate? – Amol Oct 29 '20 at 21:35
  • Getting a *tree* hash is a much more difficult proposition. It's not impossible—see, e.g., [my Python scripts that does this](https://github.com/chris3torek/scripts/blob/master/githash.py)—but files that are present in the repository, but not in the directory you have, or that are in the directory but are not in the repository, will mean that the tree object hashes will be wrong. It could be worth doing for a fast check, but it's more likely that you'll have to find commits that contain *all* the desired hashes, and call those "good enough". – torek Oct 29 '20 at 21:47
  • This has been educational. Thanks for the help, torek! – Amol Oct 30 '20 at 01:11
  • By the way, see [this answer](https://stackoverflow.com/a/48588021/1290731) for a speeded-up search that does all the rev-parses in a batch.. – jthill Nov 04 '20 at 06:44
  • @jthill: That `cat-file --batch-check=` trick, with `%(rest)` as one of the formats, is pretty good. :-) – torek Nov 04 '20 at 09:02