-1

So I managed to include ident IDs into every source used in the build. And now actually trying to use this info I have to discover, that these IDs have nothing in common with what I get from git log.

My question is:

How can one go back from the pair of filename:ID to fileName:commitId?

  • when you say ID you mean a blob id for the content of said file on a given revision? – eftshift0 Jun 04 '21 at 20:11
  • I mean the ID I can find inside the binary created by having a line #ident "fileName$Id:$" inside the source –  Jun 04 '21 at 20:14
  • If the IDs are in the files themselves, you should play with something like `git log -S the-id -- filename` – eftshift0 Jun 04 '21 at 20:25
  • [Probable duplicate](https://stackoverflow.com/questions/51727566/how-to-make-git-commit-hash-available-in-c-code-without-needless-recompiling). – jthill Jun 04 '21 at 21:44

1 Answers1

1

First, let me quote from the gitattributes documentation:

ident
      When the attribute ident is set for a path, Git replaces $Id$ in the blob object with $Id:, followed by the 40-character hexadecimal blob object name, followed by a dollar sign $ upon checkout. Any byte sequence that begins with $Id: and ends with $ in the worktree file is replaced with $Id$ upon check-in.

So the IDs in the checked-out files are blob hash IDs, not commit hash IDs. The blob hash ID is specifically (currently) an SHA-1 checksum of the contents of the data file preceded by the literal text blob, a space, an ASCII-fied representation of the size of the data in bytes, and a NUL byte '\0'.

(That's the data with $Id$ in it, not the data with the hash ID inserted, of course. So if the source file consists of $Id$\nhello\n, with \n representing newlines, we want to compute the SHA-1 of the output of:

printf 'blob 11\0$Id$\nhello\n'

since $Id$\nhello\n is 11 bytes long. This blob's hash ID is therefore 173cbef4e466bed5350cae075633cb81d1e01743.)

These are not guaranteed to be invertible, because it's possible that the identity information you can get from the binary may be insufficient to identify one particular commit. For a classic example, consider a program built from a single main.c with:

#ident "$Id$"

but where the Makefile itself has -D options that select something, and main.c has #ifdef FEATURE1 and so on.

Build #1 is made with a Makefile that says -DFEATURE1. Build #2 is made with a Makefile that does not have this -D. These two different builds are from different commits, but they have the same blob hash ID for file main.c, and therefore the two different binaries produced by linking the compiled main.o ident lines with libc have the same hash.

The closest you can get is to:

  • collect all the IDs you can get;
  • examine each potential build's source tree to identify the blob hash IDs of the corresponding inputs; and
  • list out all matching commits.

If you're lucky, there's just one matching commit.

The remaining issue is how to do the above. Presumably you will use whatever program you already use to extract the ident info from the binary, for the first bullet point. For the second and third, you must write a script.

The script itself is pretty short: you just need to look through each potential build and extract the corresponding blob hashes. So, find a commit that could be a build, then use git ls-tree -r $commithash to obtain the output from git ls-tree -r on that commit. (Run git ls-tree -r once, on one commit, to see the output; note the blob hash IDs for each mode 100644 or mode 100755 file.)

Now, match up the known object file "ident"s against the corresponding source file blob hash IDs. How to do this mapping is up to you and depends on your tools and languages used. If all known ident values match all the right sources, $commithash is a candidate hash, so print it.

Repeat for all candidate commits and you will get the best answers you can here.

(And, as you can see now, the ident filter is not really very useful: it's much better to use git describe to get a usable identity and stick it into the build output, during the build process.)

torek
  • 448,244
  • 59
  • 642
  • 775
  • You forgot to mention the command to be used to create the hashid. You only stated how to get the input for this command. Or not even that -- from what I read above I've to edit the files by removing the hashid before piping it over some to be specified command. This is ridiculous! –  Jun 07 '21 at 12:15
  • #!/bin/bash read size <<< $(sed -e 's/\$Id: [0-9a-z]* \$/\$Id\$/' $1|wc -c|sed -e 's/ .*//') sed -e 's/\$Id: [0-9a-z]* \$/\$Id\$/' $1|sed -e '1 s/^/blob '$size'\x0/'|sha1sum –  Jun 07 '21 at 13:29
  • Use `git ls-tree -r`, as I said above. Try it, e.g., `git ls-tree -r HEAD`. You're trying to match a set of known hash IDs (output from your ident command or whatever) to a set of known commits. You don't want to get hash IDs on files-as-seen-in-some-hypothetical-commit, you want to get hash IDs as seen in existing, actual commits. – torek Jun 07 '21 at 17:08