Where does the common ancestor of merged commits reside in Git (when there's criss-cross merge)?

Question

How I know it from @torek's thoroughgoing answers (for example, here), when Git does a merge between two branches (commits) and some conflict gets in its way there are three non-empty "slots" in the index: 1st stores the common (base) versions of each file, 2nd does local (ours) versions and 3rd - the remote (theirs) versions.

It is quite simple to envisage because all these versions is actually stored as blob objects in the objects directory. So Git can extract needed versions to its index from these blob objects.

However, what if criss-cross merge occurs? In this case we have multiple merge bases and Git merges them all into one commit considering the new one to be a single base (see this).

I found Git does not create any blob object for this new commit. Does it mean the base versions exist nowhere but in the working memory (the 1st slot of the index for Git)? And so, if my computer shuts down and I have an unresolved conflict, would the generated base commit be lost?

Why do you think that git doesn't create blobs for this new commit? The index always refers to items that exist in the object directory. — Edward Thomson, Nov 03 '20 at 16:15

Mark Adelsberger · Accepted Answer · 2020-11-04T01:18:17.103

Updated per comments (don't have time right now to research this, so I'm assuming what ET notes is accurate...)

First a quick aside that this is all premised on the recursive merge strategy, and on the rather specific circumstance that it will try to compute a merge base from multiple candidates.

With that, it's easiest to see what's happening if we contrive to have all these merges affecting a single file. "Easiest" is a relative term, but here we go...

O -- A - M2 <--(master)
 \     X
  ---B - M1 <--(dev)

In O we have a file named foo. A and B each change this file in a way that does not conflict. (That it not conflict may not be strictly necessary - apparently even a conflicted result can serve as the computed merge base in the recursive strategy - but it makes it easier to see what's going on if you want to reproduce these steps and look at the repo.)

M1 is an "evil merge" of A and B; although A and B would merge cleanly, an additional change to foo is made when creating M1.

Likewise, M2 is an "evil merge" of A and B; the added change to foo in M2 conflicts with the added change to foo in M1.

Now if we try to merge dev with master then foo is in conflict; and sure enough

git cat-file -p :1:foo

will show us a file that we never stored before - the result of a clean merge between the versions of foo in A and B.

So now to your question:

I found Git does not create any blob object for this new commit

That statement is a little confusing; commits are not represented by BLOB objects. It's true that there is no COMMIT object representing the calculated merge base.

But there is a new BLOB for :1:foo - just as there would be for any index entry[1]. There's also (in my test) a TREE object to contain the BLOB. In other words, the entire content of the calculated merge base is, in fact, stored in .git/objects.

Does it mean the base versions exist nowhere but in the working memory (the 1st slot of the index for Git)?

The base versions exist in the 1st slot of the index, but that is most definitely not "working memory". There generally isn't even a git process running during the time you're resolving conflicts - so no process whose working memory could be used to contain it.

As noted above, the index is made up of BLOB objects on disk.

And so, if my computer shuts down and I have an unresolved conflict, would the generated base commit be lost?

No.

But would it matter if it were? If somehow the automatically generated merge base were lost and you had to start over, git would just automatically generate it again.

[1] - Ok... sigh... not quite "any" index entry. There are "content-less" index entries for when you declare the intent to add a file. But that's not really related to all this and it just confuses matters. Only mentioning it to get ahead of the pedants and trolls.

Interestingly, git will _always_ produce an ephemeral commit using recursive merge, even if there are conflicts. If file A conflicts in the two bases, then the result in the ephemeral merge base will... have the conflict markers inside of it. How's that for terrifying? (Apologies for the pedantry.) — Edward Thomson, Nov 03 '20 at 17:50
Yikes... I thought that was not the case, but yes, terrifying. — Mark Adelsberger, Nov 03 '20 at 18:38
The intent-to-add index entries are actual regular cache entries (at stage zero), they just have the I-T-A flag bit set internally and the hash ID of the associated blob is the null hash. This has caused numerous bugs in Git over the years... Also, the index doesn't have tree objects in the index entries (though it can have cached trees!—they're just not entries, in that they don't show up in `git ls-files --stage` for instance). — torek, Nov 03 '20 at 19:42
@torek - Hm. Well, that may be, but git definitely produced a `TREE` when I was testing this; I don't recall if it showed up as dangling, so it may just be created for the heck of it...? — Mark Adelsberger, Nov 03 '20 at 19:49
That's what I mean about cached trees. They appear under an extension, code (`T`, `R`, `E`, `E`) and hold a cached result that was used by an earlier `git write-tree` internal operation. But they aren't in the staging part. It's kind of weird. You'll get a tree object due to the recursive merge having done the inner commit. This inner commit is abandoned once the merge is done. — torek, Nov 03 '20 at 19:59

torek · Answer 2 · 2020-11-03T20:52:10.830

I dug into this code in git merge-recursive at one point out of curiosity. Here are the terrifying details ("terrifying" per Edward Thomson's comment).

First, git merge-recursive or git merge-resolve, which share most of their code, gathers up all the merge base commit hash IDs. This forms a simple list (stored as a linked list when I looked, though the important thing is that it's an ordered list). If the length of the list is 1, we have a single merge base and we're done. Otherwise we have multiple merge bases:

If we're git merge-resolve, we pick one "at random" (whatever is most convenient, probably head of list) and use it and we're done.
Otherwise—when we're git merge-recursive—we use the following algorithm:
- while there are at least two entries in the list:
  1. Merge the first two merge bases, with a recursive call. This produces a merge result (in the index). If there are merge conflicts, the conflicts appear in the index.¹
  2. Forcibly shove all the results into index slot zero. Make a new commit from this tree.
  3. Replace the pair of commit hash IDs with the single commit hash ID from step 2.
At the end of this loop, there is only one hash ID in the list. This is the merge base.

Note that this algorithm is linear in the number of merge bases, when we could use a logarithmic strategy: if there are, say, 16 merge bases, we could merge each pair to get 8 merge bases, then merge each pair to get 4, then merge each pair to get 2, then merge the pair. Instead, we merge 2 to get 1, leaving 15 to merge; merge 2 to get 1, leaving 14; and so on.

This is probably totally reasonable. Criss-cross merges get you a two-merge-base setup, but they're rare. How you would ever manage to set up a 16-merge-base situation, I am not sure. Edit: And, as Raymond Chen notes, we wouldn't save any of the primary work anyway.

¹This explains why the conflicted object shows up in the repository. However, if we're doing the outermost merge—which the recursive code always knows—there was no need to put this object into the repository. Note that the merge.driver used for low level merges here is the defined driver when doing the outermost merge, and the recursive driver when doing any of the inner merges; that's why we know if this is the outermost merge. If it is the outermost merge, there was no need to create a blob in the Git repository database.

The logarithmic strategy doesn't reduce the amount of merging. It just changes the grouping. (((A + B) + C) + D) and ((A+B) + (C+D)) both have the same number of + operations. In your example, you replaced 15 merges with 8+4+2+1=15 merges. — Raymond Chen, Nov 03 '20 at 20:04
@RaymondChen Hm, that's a good point. However, these aren't simple `+` operations: each merge base pair might itself have more than one merge base. Can we prove that the logarithmic strategy still does the same number of merges in these cases? What do we know about the merge bases of the intermediate merges? — torek, Nov 03 '20 at 20:08
Can you explain what is the outermost merge? Is it a merge of common ancestors in contrast to commits which a user try to merge? — Ilya Loskutov, Nov 04 '20 at 11:25
I'm still a bit confused. When Git merges two common ancestors (following the spelled out algorithm), it creates both a blob object and a tree object but not a *commit* object, doesn't it (step 2)? — Ilya Loskutov, Nov 04 '20 at 13:00
@Mergasov: yes, it makes a tree object. I don't remember if it makes a commit object too (a commit requires a tree so it definitely makes the tree). The outermost merges are the ones that are merging two merge bases. They only need to make a tree if they're going to go on and use that tree for a subsequent merge (i.e., the list will still have at least 2 items in it). The inner merges always need to make a tree so that the outer merges have a tree to use as a merge base. — torek, Nov 04 '20 at 16:37

Where does the common ancestor of merged commits reside in Git (when there's criss-cross merge)?

2 Answers2