1

A commit may have many files changed, and the merge base for each file shouldn't be the same, so why don't two commits have multi-common ancestors when merging? In other words, why the concept of the so-called "common ancestor" or "merge base" is not at the file/revision level, but at the commit level?

tristone
  • 95
  • 6
  • Git does not track files. A revision is the commit. Git does not technically have the concept of file history. Only commit history. Tools using git may try to derive history of a single file by parsing the history of commits but at a fundamental level git only tracks commits (change **sets** - not individual changes) – slebetman Jul 29 '22 at 06:30
  • @slebetman https://stackoverflow.com/questions/11792538/in-git-what-is-the-difference-between-a-commits-and-a-revisions Thank you! But a revision is not equal to a commit. – tristone Jul 29 '22 at 06:34
  • You say not but the answer you link to says that the revision is the commit with a few minor exceptions. In the example with the path specifier (`:`) the porcelain layer (high level git commands -- the `git` command you type on the console) merely parses the commit history to find the specific file at a specific commit. At the low level (the plumbing layer) git does not track that individual file but instead the diff of the commit. – slebetman Jul 29 '22 at 06:40
  • .. Of course you can parse the diff for a specific file but the file itself does not have an id - the id is the id of the commit (eg: 123456789:src/index.js the 123456789 id refers to the commit) – slebetman Jul 29 '22 at 06:40
  • Two commits can have many common ancestors, a merge base is the best choice for these common ancestors. From the file's level, two files' "common ancestor"s are the subset of the two commits' "common ancestor"s, so the two files' "merge base" must also belong to the two commits' "common ancestor"s. – tristone Jul 29 '22 at 06:49
  • @slebetman So what does "apply change" mean? If commit c changes file f from version B to C, currently you are on the commit with file version A, if you apply commit c to this commit, what will happen? Git just knows the diff between B and C, how does it know how to apply this change on the base on A? – tristone Jul 29 '22 at 06:53

1 Answers1

1

A commit may have many files changed,

Well, yes; but also no.

and the merge base for each file shouldn't be the same,

That's a matter of judgment (or judgement; it's a matter of judg[e]ment to choose the spelling of "judg[e]ment").

so why don't two commits have multi-common ancestors when merging?

To a large extent the answer is just "because Git defined it that way".

At this point, let me expand on the no-and-yes part above: no commit contains changes. Instead, each commit contains two things:

  • a full snapshot of every file as of its state as of that commit; and
  • some metadata, by which Git obtains things like the author and log message and—important for forming one of the things humans call "branches"—the history.

It's the metadata, and most specifically the stored hash ID(s) of parent commit(s), that implies "changes". If commit b789abc... has commit a123456... as its parent, we will have Git extract both commits (two snapshots) and then compare them to find changes. Neither commit actually contains any changes at all: the difference between these two commits is located dynamically.

It's useful to note here that the snapshot in any given commit is stored (via the metadata in fact) in a special form in which each file's content is separated from the file's name and is compressed and de-duplicated, so that Git can tell if some file in some commit is an exact duplicate of some other file in any commit (the same commit, or any other commit) very quickly and easily, without actually reading the file at all. So "compare two commits" devolves quickly into "compare differing files in the two commits": skipping over the identical files is nearly free. (There's a lot of gory detail hiding under the word nearly here, though.)

In other words, why the concept of the so-called "common ancestor" or "merge base" is not at the file/revision level, but at the commit level?

In fact, when we use git format-patch and git am, it is (sometimes1) at the file-specific level, via the Index: lines shown in git diff output. The result is normally the same however. Let's use an example to see why:

          I--J   <-- br1
         /
...--G--H
         \
          K--L   <-- br2

Here we have two branch names, br1 and br2, each of which select a branch-tip commit (J and L respectively), each of which have full snapshots of every file. Meanwhile each one has a parent commit (I and K respectively), each of which have full snapshots of every file, and those two commits have a common parent H, which has a full snapshot of every file.

Let's say that we have five files with the same names in our five commits:

  • unchanged is identical in H, I, J, K, and L;
  • f1a is identical in H and I and K and L but is different in J;
  • f1b is identical in H and K and L, but different in I then unchanged from I in J;
  • f2a is identical in H and I and J but different in K and then changed yet again in L; and
  • f2b is identical in H and I and K, but different in K and different yet again in L.

What would you choose for merge bases here? Git chooses the copy in H in all cases. Git then diffs that copy (in H) against the copies in both J and L, in two separate git diffs, by running:

git diff --find-renames <hash-of-H> <hash-of-J>   # changes in br1
git diff --find-renames <hash-of-H> <hash-of-L>   # changes in br2

This --find-renames step is important as well, since it can discover that one branch, or even both branches, renamed some files.

Note that Git skips right over commits I and K: only commits H and J and L are interesting here. Git would sometimes get different results if it did commit-by-commit comparisons. Which way is "right"? We're back to a matter of judgement (I like it better with the extra e myself).

In any case, by doing it this way, Git compares the original copy of each file against each branch-tip copy. But since the copies are literally de-duplicated, any commit that contains any identical content for that file would serve. So if a file changed only in J, like f1a, any of the pre-J commit copies will produce the same diff. If a file changed in I and then J, like f1b, any of the pre-I versions will produce the same diff, and that's the one we want used because we want to carry the change in the merge, and the change is "as seen from H". So for both f1a and f1b, using the copy in H is fine.

Similarly, when we're working on the br2 branch changes, we really want to see what's happened since the files were actually shared. They were definitely shared in H. They might have been shared even before then, but so what? They might not be shared in K and L, as is the case with f2b, but here we want the copy that's in H, which is what we get.

We'll get the "right" results if we go even further back than H, and sometimes get the right results if we don't go as far back as H, but H always works to get "changes since the files were shared" because the files were definitely shared at H. The only thing we miss, maybe, is multiple changes along the way between H and one or both tip commits. In some cases, we might care about that, but Git's merge is simply defined as "not caring about that".


1Again, there is a lot of gory detail here. When git apply is using -3 or --3way, in very old versions of Git it tries to patch first, then falls back on the Index lines, and in newer versions of Git it uses the Index lines first. Again, the result is pretty typically the same, but treating a three-way merge as a simple textual patch the way the patch command does, rather than applying the full three-way merge logic, doesn't always yield the desired result (though now we get into judg[e]ment calls about "desired result", and I'll just stop here...).


Sometimes there is more than one merge base anyway

It's worth mentioning here that, although Git finds the merge base via commit ancestry (not file ancestry), Git does store a directed acyclic graph or DAG of commits. When Git searches for a common ancestor, then, it cannot use the simple Lowest Common Ancestor algorithm that works for trees. It must instead use an extended variant for DAGs. But the extended variant may find more than one LCA. The classic example occurs with what people call a criss-cross merge, like this:

...--o--●---M--R   <-- br1
         \ /
          X
         / \
...--o--●---N--S   <-- br2

Here, commit M is a merge commit joining the two solid bullet commits on br1, and commit N is a merge commit joining the same two solid bullet commits on br2. If these two merges are made the same way, they contain the same set of files: the same snapshot. So that part is fine, under that condition. However, when Git goes to locate the merge base of the two now-tip-most commits R and S,2 the LCA algorithm will find both of the solid-bullet commits as merge bases.

What Git does to handle this by default (with the "recursive" or "ort" merge strategies) is to merge the merge bases. Ideally this produces the same snapshot that we find in either M or N and then all is fine, but if it doesn't, things get ... interesting. In particular, if there were merge conflicts involved, that had to be hand resolved, and the resolutions are different in M and N, the new merge base contains the merge conflicts (committed complete with merge conflict markers (!) in the "virtual merge" that Git makes as its temporary merge base commit) and then the two different resolutions in M and N cause further conflicts and the result looks really weird.

To see what commit(s) Git picks as merge bases, we have git merge-base, but unfortunately we must run that with --all to show the complete list of candidates. As multiple merge bases are not common this is not usually a big problem, but it can be a problem and one should be aware of it.


2I skipped O as it looks too much like o, leaving P and Q as the next two letters, but Q looks too much like O, so I moved on to R and S for their shape-distinctiveness.

torek
  • 448,244
  • 59
  • 642
  • 775
  • Very nice illustration! So can I say that the common ancestor at the file level is always equal to that at the commit level? Then it is unnecessary to differentiate them because they are always the same. – tristone Jul 30 '22 at 12:22
  • 1
    Essentially, yes. "File ancestry" (however you want to define it) might "smear across" commit ancestry, but the commit ancestry pinpoints it. We also need to use the rename detector over the entire file tree, and the merge base commit is the place to start that. – torek Jul 30 '22 at 12:33