7

I'm trying to better understand the magic behind git-rebase. I was very pleasantly surprised today by the following behavior, which I didn't expect.

TLDR: I rebased a shared branch, causing all commit sha1s to change. Despite this, a derived branch was able to accurately identify that its original commits were "aliased" into new commits with different sha1s. The rebase didn't create any mess at all.

Details

Take a master branch: M1

Branch it off into branch-X, with some additional commits added: M1-A1-B1-C1. Note down the git-log output.

Branch off branch-X into branch-Y, with one additional commit added: M1-A1-B1-C1-D1. Note down the git-log output.

Add a new commit to the tip of the master branch: M1-M2

Rebase branch-X onto the updated master: M1-M2-A2-B2-C2. Note that A2-B2-C2, all have the same message, contents and author-date as A1-B1-C1. However, they have completely different sha1 values, as well as commit dates. According to this writeup, the reason the SHA1 is different is because the commit's parent has changed.

Rebase branch-Y onto the updated branch-X. Result: M1-M2-A2-B2-C2-D2.

Notably only the D1 commit is applied (and becomes D2). The A1-B1-C1 commits in branch-Y are completely ignored by git-rebase. You can see this in the output logs.

This is wonderful, but how does git-rebase know to ignore A1-B1-C1? How does git-rebase know that A2-B2-C2 are the same as A1-B1-C1, and hence, can be safely ignored? I had always assumed that git keeps track of commits using the sha1 identifier, but despite the above commits having different sha1s, git still somehow knows that they are linked together. How does it do that? Given the above behavior, when is it truly dangerous to rebase a shared branch?

RvPr
  • 1,074
  • 1
  • 9
  • 26

3 Answers3

13

Internally, git rebase lists commits that should be rebased, and then computes a patch-id for these commits. Unlike the commit id, it only hashes the content of the patch, not the content of the tree and commit objects. So, A1 and A2, while having different identifiers, have the same patch-id. Then, git rebase skips patches whose patch-id is already present.

For more information, search patch-id here: https://git-scm.com/book/en/v2/Git-Branching-Rebasing


Relevant section from above (diagrams missing):

If someone on your team force pushes changes that overwrite work that you’ve based work on, your challenge is to figure out what is yours and what they’ve rewritten.

It turns out that in addition to the commit SHA-1 checksum, Git also calculates a checksum that is based just on the patch introduced with the commit. This is called a “patch-id”.

If you pull down work that was rewritten and rebase it on top of the new commits from your partner, Git can often successfully figure out what is uniquely yours and apply them back on top of the new branch.

For instance, in the previous scenario, if instead of doing a merge when we’re at Someone pushes rebased commits, abandoning commits you’ve based your work on we run git rebase teamone/master, Git will:

  • Determine what work is unique to our branch (C2, C3, C4, C6, C7)
  • Determine which are not merge commits (C2, C3, C4)
  • Determine which have not been rewritten into the target branch (just C2 and C3, since C4 is the same patch as C4')
  • Apply those commits to the top of teamone/master

This only works if C4 and C4' that your partner made are almost exactly the same patch. Otherwise the rebase won’t be able to tell that it’s a duplicate and will add another C4-like patch (which will probably fail to apply cleanly, since the changes would already be at least somewhat there).

RvPr
  • 1,074
  • 1
  • 9
  • 26
Matthieu Moy
  • 15,151
  • 5
  • 38
  • 65
5

There are in fact several different method git rebase uses to eliminate redundant copies.

Patch-ID

The first, and safest, one is via the same method that git cherry uses to identify cherry-picked commits. If you read the linked documentation, though, the only clue as to how this works is at the end, where the manual page links to the git patch-id documentation.

Reading this second manual page will give you a good idea of how "commit equivalence" gets established: Git simply computes a git patch-id on the output from, e.g., git show of any ordinary (non-merge) commit. Really, it runs git diff-tree rather than the user-oriented git show, but the effect is about the same.

But there's still something missing, and it's very poorly documented in either of git rebase or git cherry. It's documented somewhat better in git rev-list, which is a rather daunting manual page. There are two keys: the notion of symmetric difference, using the three-dot syntax described in the gitrevisions documentation, and the --left-right and --cherry-mark options to git rev-list.

Once you understand how we take a DAGlet such as:

...--o--o--L1--L2--L3   <-- left
         \
          R1--R2--R3   <-- right

and use left...right to select the three L and R commits, the --left-right option itself makes lots of sense: it marks which commits in the text output are from the left side of the three dots, and which are right-side commits.

The second step here is discovering that git rev-list can compute the patch ID for each commit on each "side". Git can then compare all the left-side patch IDs with all the right-side patch-IDs. The --cherry-mark option, and its related options, use these to mark equivalent or inequivalent commits, or to omit equivalent commits.

The final piece to this particular puzzle is that git rebase does not, as the documentation claims, use <upstream>..HEAD. Instead, it uses the equivalent of git rev-list --cherry-pick --right-only --no-merges <upstream>...HEAD to get the set of commits to copy. (To these options we must also add --topo-order and --reverse.)

Fork-point

The second method git rebase uses to elide commits is the --fork-point mechanism now built into git merge-base. This mechanism is particularly tricky to describe, and furthermore, relies on reflog entries to know about commits that were on a branch in the past, but are no longer. It also gives an undesirable result sometimes, and is not useful in this particular kind of rebase.

I mainly mention it here because someone looking for reasons that git rebase left out some commit(s) might have come across a case where the fork-point mechanism has misfired. See, e.g.:

torek
  • 448,244
  • 59
  • 642
  • 775
1

The branch-Y commits are empty upon the second rebase

There is really no magic hidden inside. Rebase searches for common history and ignores it (only commit M1 in this case). Detaches the history from rebased branch (Y) and tries to pick it on the new base (branch-X).

The picking method derives a patch from a previous and picked commit. As it is empty for A1, B1 and C1, it simply skips these commits. Only D1 is then picked and therefore a D2 is created (with new SHA as the parent link in header changes; as correctly stated in the question).

petrpulc
  • 940
  • 6
  • 22
  • 1
    BTW, the rebasing of shared branch is unpleasant because of the force push you need to do afterwards. History diverges for remote and local of other users and their new work might get lost if they are not careful enough. So, never force push on shared branch :) – petrpulc Aug 23 '17 at 19:29
  • 1
    essentially this is saying: git saw that the changes from `A1` were already on the updated branch-X (in the form of `A2`), so it just threw `A1` away as redundant – Eevee Aug 23 '17 at 19:31
  • @petrpulc I'm afraid I don't understand what you mean by *"branch-Y commits are empty"* and *"picking method derives a patch ... is empty for A1, B1 and C1"*. The commits A1, B1 and C1 are themselves not empty, so can you clarify why you say they are "empty" and skipped? – RvPr Aug 23 '17 at 19:37
  • @Eevee You're suggesting that because the changes introduced by A1, already match the latest state of branch-X, the commit is discarded as a result? I suppose that might explain the behavior I saw, though I'd have to double check. If that hypothesis were true, then we should see many conflicts and superfluous commits if A1 and B1 were to both touch the same lines of code? – RvPr Aug 23 '17 at 19:40
  • @Eevee yes, the internal method used is really just a patch. And if patch does nothing, no commit needs to be added. Well, because everything is there already. – petrpulc Aug 23 '17 at 19:45
  • @RvPr Well, rebase gets a difference between M1 and A1 and tries to apply resulting patch to branch-X. The patch executable does nothing because the patch seems like already applied. The same is for diff between A1 and B1, resp. B1 and C1. Only patch between C1 and D1 results in change and therefore a commit. – petrpulc Aug 23 '17 at 19:49
  • @petrpulc Thanks for the clarifications. I wasn't familiar with how patches were generated/used in rebase. Once I understood that, the behavior in my question now makes sense. – RvPr Aug 23 '17 at 20:31
  • 1
    While the commits would *become* "empty" (and then be skipped unless you added `--keep-empty`), `git rebase` in fact peels them off its list before it even starts. The precise mechanism is different for non-interactive and interactive rebase (non-interactive defaults to using `git format-patch` instead of `git cherry-pick`) but both kinds of rebase use fancy `git rev-list` commands to omit patch-equivalent commits. (If you *do* add `--keep-empty`, the non-interactive rebase is forced to use cherry-pick because format-patch won't produce an empty patch!) – torek Aug 23 '17 at 20:37