Why does git log --find-object get two file commits with different content for a given blob?

Question

I am using git log --find-object to identify commits by providing git file blobs (file content hashes).

This works usually fine, I get the blob before for a file by using git hash-object.

However, sometimes for a given blob hash of a file, git log --find-object=<blob> returns two commits for the same file, where the contents of the files of the returned commits definitely differs.

Getting multiple commits where the corresponding files contents is the same I would expect, but having commits reported where the content is not exactly the same seems odd to me (that is based on how I would understand the --find-object option atm).

Why is that? Where would I have to elaborate with the command?

LeGEC · Accepted Answer · 2023-02-23T10:11:35.347

As stated by the documentation (also refer to the -S and -G option to make sense of it) :
with this option, a commit will be mentioned if the number of occurrences of said object changes.

So, if you take the blobid of a file in your repo (say, the blobid of file Readme.md)

git log --find-object=<blobid> will :

report commits where this blobid appears as file Readme.md (that's what you expect),
report commits where that blob disappears as file Readme.md, eg : a commit which changed the content of Readme.md from blobid to something else ;
report commits where this blob appears or disappears at some other path, eg : at some point, file doc/Doc.md had the exact same blobid ;
not report commits where a file with that exact content has been renamed, eg : file doc/Doc.md has been renamed to Readme.md, or from Readme.md to doc/Doc.md

You can run :

git ls-tree -r <commit> | grep <blobid>
# check parent commit too :
git ls-tree -r <commit>^ | grep <blobid>

to see which <commit> contained that blob, and at what path.

If you want to check what modified the precise path Readme.md, you can add it as a filter to git log :

git log --find-object=blobid -- Readme.md

This will get rid of cases 3. and 4. above.
You would still see commits where the content you look for is in the parent commit (case 2. above).

A rename won't change the number of occurrences of some blob hash ID, so this doesn't find renames unless you limit the `git log` to looking for a particular pathspec (without `--follow`). — torek, Oct 07 '20 at 22:28
@LeGEC: Thanks, I was unaware of the 'disappearing' counting part of git log. Not super intuitive - but then again, it is git... :) — Rotax, Oct 09 '20 at 15:22
I reworded my answer, to take into account @torek's remark, and hopefully make it clearer for other people. — LeGEC, Oct 09 '20 at 15:46

score 1 · Answer 2 · answered Oct 07 '20 at 22:25

This is a little more precise than LeGEC's answer although that one covers the most common case. What git log --find-object does is find commits where, from parent to child, the commit changes the number of occurrences of that particular blob.

Suppose, for instance, we create a new empty repository with one initial commit with a README file:

$ mkdir tlog
$ cd tlog
$ git init
Initialized empty Git repository in [path]
$ echo test find-object stuff > README
$ git add README
$ git commit -m initial
[master (root-commit) 2177143] initial
 1 file changed, 1 insertion(+)
 create mode 100644 README

Now let's create a blob, commit it, and observe its hash ID:

$ echo file content > afile
$ git add afile
$ git commit -m 'add some content'
[master 45c4e39] add some content
 1 file changed, 1 insertion(+)
 create mode 100644 afile
$ git rev-parse HEAD:afile
dd59d098638313f5d00a7fa657379b33b191f2e2
$ blobid=$(git rev-parse HEAD:afile)

Now let's make a commit that doesn't change the number of files that have that blob hash ID, by adding a file with different content, then add a third file with the same content—hence same blob hash ID—as the first file:

$ echo different > bfile
$ git add bfile && git commit -m 'add different content'
[master c5a5306] add different content
 1 file changed, 1 insertion(+)
 create mode 100644 bfile
$ cp afile cfile && git add cfile 
$ git commit -m 're-add same content as afile, ie, same blob id'
[master 20c97e5] re-add same content as afile, ie, same blob id
 1 file changed, 1 insertion(+)
 create mode 100644 cfile
$ git rev-parse HEAD:cfile
dd59d098638313f5d00a7fa657379b33b191f2e2

As you can see, the same hash ID comes up again. (In fact, any repository with a file that matches my afile or cfile has that blob hash ID in it! The commits will have unique hash IDs, but any file that reads file content plus a single newline will have blob hash ID dd59d098638313f5d00a7fa657379b33b191f2e2.)

Now let's look at git log --oneline and git log --oneline --find-object=$blobid output:

$ git log --oneline
20c97e5 (HEAD -> master) re-add same content as afile, ie, same blob id
c5a5306 add different content
45c4e39 add some content
2177143 initial
$ git log --oneline --find-object=$blobid
20c97e5 (HEAD -> master) re-add same content as afile, ie, same blob id
45c4e39 add some content

We see commit 45c4e39 in both cases because comparing 2177143 initial to 45c4e39 add some content shows that the number of files that have $blobid as their object hash has gone from zero to one. We see 20c97e5 because comparing that commit to its parent, c5a5306, shows that the number of files has gone from 1 to 2. If we remove one copy, the count will change again and we'll see that commit. If we remove both copies, the count will change (to zero) and we'll see that commit.

What we're seeing, in other words, is every commit in which the count of blob objects with the given hash ID changes.

There's a bug, of sorts, in this git log option: it relies on the fact that each of these commits has one single parent. If we have a merge commit—a commit with two or more parents—Git has to compare the blob hash IDs in the merge to both parents. Perhaps the count changes in one comparison but not in the other. What should Git do with this? Git's current answer is that it craps out completely here—hence "bug of sorts"—but with a fix that's in the queue, you get something that's better but still imperfect, as there's no obvious Right Answer for this case. (The bug is that Git is going through a special code path in git log that's meant to handle History Simplification, and that's the wrong thing to do here. The proposed fix makes Git go through a more suitable path, so that you'll at least see that the merge has some change in the count, which is clearly significantly better. But that leaves other cases for other options that don't always work right, too. Git needs a general solution for diffs-across-merges, and that requires a framework that currently does not exist.)

score 1 · Answer 3 · answered Oct 12 '20 at 13:21

Note that the result of that command might change with Git 2.29 (Q4 2020): "git log -c --find-object=X" did not work well to find a merge that involves a change to an object X from only one parent.

See commit 957876f (30 Sep 2020) by Jeff King (peff).
^{(Merged by Junio C Hamano -- gitster -- in commit 7da656f, 05 Oct 2020)}

combine-diff: handle --find-object in multitree code path

^{Signed-off-by: Jeff King}

When doing combined diffs, we have two possible code paths:

a slower one which independently diffs against each parent, applies any filters, and then intersects the resulting paths

a faster one which walks all trees simultaneously

When the diff options specify that we must do certain filters, like pickaxe, then we always use the slow path, since the pickaxe code only knows how to handle filepairs, not the n-parent entries generated for combined diffs.

But there are two problems with the slow path:

It's slow. Running:

git rev-list HEAD | git diff-tree --stdin -r -c

in git.git takes ~3s on my machine.
But adding "--find-object" to that increases it to ~6s, even though find-object itself should incur only a few extra oid comparisons.
On linux.git, it's even worse: 35s versus 215s. 2. It doesn't catch all cases where a particular path is interesting.
Consider a merge with parent blobs X and Y for a particular path, and end result Z. That should be interesting according to "-c", because the result doesn't match either parent. And it should be interesting even with "--find-object=X", because "X" went away in the merge.

But because we perform each pairwise diff independently, this confuses the intersection code. The change from X to Z is still interesting according to --find-object. But in the other parent we went from Y to Z, so the diff appears empty! That causes the intersection code to think that parent didn't change the path, and thus it's not interesting for "-c".

This patch fixes both by implementing --find-object for the multitree code.

It's a bit unfortunate that we have to duplicate some logic from diffcore-pickaxe, but this is the best we can do for now. In an ideal world, all of the diffcore code would stop thinking about filepairs and start thinking about n-parent sets, and we could use the multitree walk with all of it.

Until then, there are some leftover warts:

other pickaxe operations, like -S or -G, still suffer from both problems.
These would be hard to adapt because they rely on having a diff_filespec() for each path to look at content. And we'd need to define what an n-way "change" means in each case (probably easy for "-S", which can compare counts, but not so clear for -G, which is about grepping diffs).

other options besides --find-object may cause us to use the slow pairwise path, in which case we'll go back to producing a different (wrong) answer for the X/Y/Z case above.

We may be able to hack around these, but I think the ultimate solution will be a larger rewrite of the diffcore code.
For now, this patch improves one specific case but leaves the rest.

Why does git log --find-object get two file commits with different content for a given blob?

3 Answers3

`combine-diff`: handle --find-object in multitree code path

Why does git log --find-object get two file commits with different content for a given blob?

3 Answers3

combine-diff: handle --find-object in multitree code path

`combine-diff`: handle --find-object in multitree code path