Which commits does git rebase omit?

Question

The git documentation says the following:

The commits that were previously saved into the temporary area are then reapplied to the current branch, one by one, in order. Note that any commits in HEAD which introduce the same textual changes as a commit in HEAD..upstream are omitted (i.e., a patch already accepted upstream with a different commit message or timestamp will be skipped).

Which seemed a bit confounding to me. Does this simply mean that any commit in the branch being rebased that doesn't change anything in the branch being rebased onto is omitted from the set of commits to be copied ?

if so:

what if new_base is specified ? does this change the set of commits from HEAD..upstream to HEAD..new_base ?
Why use a range ? why not just say "any commits in HEAD which introduce the same textual changes as a commit in upstream are omitted" ?

score 3 · Answer 1 · answered Apr 17 '22 at 21:33

Besides jthill's answer, which gets into some of the details of git rev-list's trickiness with using --no-merges and git patch-id, I'll add the following notes:

--no-merges is suppressed if you use --rebase-merges.
With --fork-point—which is sometimes the default—the rebase will omit commits that would be listed in upstream..HEAD based on data contained in the reflogs for the upstream.

The latter has gone through multiple changes over the years. At first, fork-point mode was implemented only in the git pull script. Then it was moved to git rebase proper, with the implementation tweaked, and now you can run git merge-base --fork-point to locate the fork-point commit.

Using the "fork point" is meant to help when dealing with an upstream rebase. That is, suppose that you have your clone of the Git repository over at origin. You've sent commits to whoever controls that repository. They may or may not have taken some of your commits. They may or may not have taken some other commits from other users. At some point, though, they also ran git rebase --interactive themselves and dropped from their set of commits some commits that they had at some point, that you rebased onto at some point.

Let's draw a sample situation. They started with this:

...--o--o--*   <-- main

You cloned the repository and created three commits of your own, which we'll call E-F-G for no obvious reason:

...--o--o--*   <-- main, origin/main
            \
             E--F--G   <-- feature-X

They picked up four new commits, none matching yours yet, so that when you ran git fetch you got:

             A--B--C--D   <-- origin/feature-X
            /
...--o--o--*   <-- origin/main
            \
             E--F--G   <-- feature-X

You then rebased your feature-X atop their feature-X (your origin/feature-X) to get:

                        E'-F'-G'  <-- feature-X
                       /
             A--B--C--D   <-- origin/feature-X
            /
...--o--o--*   <-- origin/main
            \
             E--F--G   [abandoned]

They then decide that commit C is bad so they rewrote their feature-X to drop C and replace their D' with a new commit D'. When you run git fetch, you get:

                  C--D--E'-F'-G'  <-- feature-X
                 /
             A--B--D'  <-- origin/feature-X
            /
...--o--o--*   <-- origin/main
            \
             E--F--G   [abandoned]

They then decide they like your commit G or G' (whichever it is) so much that they incorporate this into their feature-X, so that if you git fetch again you get:

                  C--D--E'-F'-G'  <-- feature-X
                 /
             A--B--D'-G"  <-- origin/feature-X
            /
...--o--o--*   <-- origin/main
            \
             E--F--G   [abandoned]

where their G" is their copy of your G or G': it introduces the same changes as your commit G' does to your commit F', but the line numbers don't match up.

Ideally you would like git rebase to somehow automatically determine that their commits C and D, which now look like they're your commits, are their commits and were dropped in favor of their D', and that their commit G" is "as good as" your G'. So you would want git rebase to produce this:

                  C--D--E'-F'-G'  [abandoned]
                 /
             A--B--D'-G"  <-- origin/feature-X
            /          \
...--o--o--*            E"-F"  <-- feature-X
            \
             E--F--G   [abandoned]

That is, you want your git rebase to:

not copy commit C, even though you have one and they don't;
not copy commit D', and
not copy commit G'.

The patch-ID tricks that git rebase uses might cope with D' and G' here but would not correctly omit C. The fork-point code will correctly omit C, provided your origin/feature-X branch's reflog has the right information in it. That will generally be true as long as all of this activity has occurred within the last 90 days or so.

For more on the --fork-point option, see Git rebase - commit select in fork-point mode and (of course) the git rebase documentation.

I was also curious about what fork point does when I read the docs (I didn't get it). Thanks a lot for making it clear. So do you think its a good practice to always pass --fork-point in case the upstream gets changed ? It doesn't seem to have any adverse effects when its not the case, right ? — Houidi mohamed amin, Apr 18 '22 at 12:47
@Houidimohamedamin: `--fork-point` is implied when you run `git rebase` with no arguments and thus use the branch's configured upstream. It's relatively rare for it to make any difference anyway, and I'm not convinced that Git's default (of using it when you don't call for it, except when you specify an upstream) is correct, so I think it's just a good idea to be aware that it exists and to check the results of any rebase (checking is *always* required because Git will sometimes mis-match on "noise" lines like close braces!). — torek, Apr 18 '22 at 13:47

score 2 · Accepted Answer · answered Apr 17 '22 at 18:10

2

Does this simply mean that any commit in the branch being rebased that doesn't change anything in the branch being rebased onto is omitted from the set of commits to be copied ?

No. To get down to brass tacks, you can see what commits a plain git rebase regards as candidates with

git rev-list --reverse --no-merges @{upstream}..

and the commits it's checking against, to avoid reapplying already-applied commits, with

git rev-list --reverse --no-merges ..@{upstream}

The checking uses git patch-id. To see what git's looking at,

git rev-list --reverse --first-parent --no-merges @{upstream}.. \
| git diff-tree --patch --stdin \
| git patch-id

and

git rev-list --reverse --no-merges ..@{upstream} \
| git diff-tree --patch --stdin
| git patch-id

except that sequence is in the internals of the --right-only the git format-patch --right-only @{u}... the rebase really runs does to get its information.

answered Apr 17 '22 at 18:10

jthill

55,082
5
77
137

So just to make sure I got this right: git takes all the upstream..head commits and for each one of those, check the diffs they introduce (as compared to their parents) against all the diffs of head..newbase commits, to see if they introduce the same changes. – Houidi mohamed amin Apr 17 '22 at 21:03
1

@Houidimohamedamin: effectively, yes. All of the patch-ID stuff is built into `git rev-list`; it doesn't strictly work off diffs, but rather off diffs-minus-certain-things, but it's close enough. – torek Apr 17 '22 at 21:12
1

@Houidimohamedamin yes, try generating those patch-id lists, sorting a combined list of id's and excluding duplicated entries is fast. – jthill Apr 17 '22 at 22:26
@torek sorry for the late question, i couldn't wrap my head around it yesterday and formulate it properly. so if one of the commits being copied, has the same changes as an older commit that is also reachable from both head and new base, but not among the ones being copied, then the former commit will not be omitted, because its not being checked against the latter, right ? If so wouldn't it make more sense to check against all of newbase rather than head..newbase ? – Houidi mohamed amin Apr 18 '22 at 11:39
1

That's correct: if the commit with that patch ID would not be listed by `git rev-list --left-right upstream...HEAD` before you start the `git rebase`, `git rebase` won't generate its patch ID and therefore won't see it as "already existing". The reason for this is that generating patch-IDs takes a noticeable amount of time. Rebasing 1 or 10 or 100 commits with a range that results in scanning 2 or 20 or 200 commits is tolerable, but rebasing 1 commit with a range that results in scanning, say, 253000 commits—and every rebase would scan *every* commit on the upstream—is not. – torek Apr 18 '22 at 11:44
1

Remember that the set of commits *reachable from* some named commit includes every commit from there on backwards, to the (or all reachable) root commit(s). Run `git rev-list upstream | wc -l` to see how many that is in your repository; and try the same in a Git clone of the Linux kernel for instance. – torek Apr 18 '22 at 11:45
got it. "Remember that the set of commits reachable from some named commit includes every commit from there on backwards, to the (or all reachable) root commit(s)." Yeah I was aware of that, I just thought it would be more thorough to check the entire thing, for instance in a case similar to the one I mentioned, but I now understand it doesn't make sense performance wise. Thanks a lot for clearing that up, and also helping me better understand jthill's response! – Houidi mohamed amin Apr 18 '22 at 12:44

score 1 · Answer 3 · answered Apr 17 '22 at 17:22

Which seemed a bit confounding to me. Does this simply mean that any commit in the branch being rebased that doesn't change anything in the branch being rebased onto is omitted from the set of commits to be copied ?

Yes, mainly because in a normal situation, Git would never record an empty commit unless explicitly told so. What you would get instead here is a message "your working directory is clean" returned by git status.

what if new_base is specified ? does this change the set of commits from HEAD..upstream to HEAD..new_base ?

I believe this is what they meant by "upstream" indeed.

Why use a range ? why not just say "any commits in HEAD which introduce the same textual changes as a commit in upstream are omitted" ?

Because that's the way Git actually distinguishes two branches. The bottom commit of this range is the place both your branches have started to fork.

"I believe this is what they meant by "upstream" indeed." in the docs they are two parameters, and if new_base is not specified with --onto then upstream is used as the base. usually they include "(or new_base)" after every mention of upstream where new_base could be used instead. that is what confused me, but i've also tried it out and can confirm that when new_base is specified the set of commits becomes HEAD..new_base — Houidi mohamed amin, Apr 17 '22 at 17:31

Which commits does git rebase omit?

3 Answers3