0

I've tried web search and found Rebasing a Git merge commit, where it is written:

By default, a rebase will simply drop merge commits from the todo list, and put the rebased commits into a single, linear branch.

So now I understand why on git rebase -i HEAD~9 I did not see my merged commit in a list of commits to edit/pick in interactive rebase. However, I saw two additional older commits in the list (10 commits total, merge absent). Why? Is it ok? I did rebase and now wonder maybe I better redo all recent commits. I wanted to rebase to try to remove unneeded local commits in the first place after the merge.

Last commit was merge, before that:

git status
Your branch and 'origin/devel' have diverged,
and have 6 and 2 different commits each, respectively.
  (use "git pull" to merge the remote branch into yours)

So I guess those two more is because there were 2 commits from origin that were merged. Am I correct here?

Martian2020
  • 307
  • 3
  • 12

1 Answers1

1

No commit, once made, can ever be changed. Hence the job that git rebase performs is to copy some existing commits (that you like to some extent, but dislike something about those commits) to new and improved—well, you hope improved, anyway—commits. Git then sets up the branch name—branch names being how you tell Git to find commits—so that it finds the new and improved commits instead of the originals.

The main concerns when doing this copy-and-replace operation lie in determining which commits to copy. As you've seen, you're having Git copy both your commits and the merged-in commits, and as you saw in the documentation, the default action during this sort of copy-and-switch-over is to drop merge commits entirely (due to them being unnecessary, which in turn is due to the way the copies are performed).

The Q&A you linked are, however, quite old, going back some 11 years at this point. Git recently (within the last year or two) learned a new git rebase trick, which is to re-perform merges. The rebase code spells this --rebase-merges, or -r for short. You can use git rebase -r now to do some things you could not do with git rebase in the past.

So I guess those two more is because there were 2 commits from origin that were merged. Am I correct here?

More or less, yes. The main tricky part here is how we specify which commits Git should copy and which commits Git should not copy.

But, as the phrases main concerns and main tricky part imply, there are additional tricky parts to rebasing. To understand what you're doing, you must know how commits work in Git. It's a good idea to understand this in general, not just for rebase work. Git is, ultimately, all about commits. It is not about files, although commits contain files and humans care a lot about file contents. Git is not about branches either, although branch names are important when it comes to finding commits. But Git itself exists because of commits, and to work with commits. So you need to know what a commit is and does for you.

Every Git commit is numbered: each one gets a unique number, that looks (but isn't) random. This number is guaranteed to be unique,1 not just in this repository, but in every Git repository, past, present, and future. That means that if you write down the hash ID of some commit you make, and inspect any old Git repository any time and find that it has a commit with this hash ID, you know immediately that the repository you're looking at has your commit.

This hash ID scheme is why no commit can ever be changed. If it could, we could break Git very easily by making different commits use the same hash ID.2 In any case, since we expect to improve our commits when we rebase them, we won't be able to keep the originals: we have to make at least some new copies that have new and different hash IDs.

Meanwhile, each commit stores two things:

  • A commit has a full snapshot of every file that Git knew about at the time you, or whoever, made the commit. These files are stored in a special, read-only (hashed!), Git-only format, compressed and—important to keep the repository size under control, among other reasons—de-duplicated against all other files in every commit in the repository.

  • A commit stores metadata: information about the commit itself, such as who made it and when. Here you'll find your name and email address, which Git has copied out of your user.name and user.email settings. You will find a date-and-time stamp. You may store a commit log message telling everyone why you made the commit.3

Now, in that metadata in each commit, Git adds something for itself. In each commit, Git stores a list of raw hash IDs of previous commits. Git calls these the parents of the commit. Most commits have exactly one parent: the previous commit, i.e., the one commit that comes right before this commit.

This parent hash ID, stored in each commit, is how Git is able to work backwards through time, to see what you and others have done. Let's imagine we have some commit with some hash ID H, that's the latest commit that you just made (on some branch). Commit H contains a snapshot and metadata, and that metadata contains the raw hash ID of some earlier commit. We say that H points to its parent. Let's call the parent G:

        <-G <-H

The arrow coming out of H, pointing backwards to G, is the parent of H. But G is a commit too, so it too has a snapshot and a parent—a backwards arrow—pointing to some earlier commit F:

... <-F <-G <-H

and F has an arrow pointing back to some still-earlier commit, and so on.


1Git makes this promise, but the pigeonhole principle tells us that Git will fail someday. The size of the hash IDs helps push that day far enough into the future that we will all be dead and not care.

2The breakage would be somewhat limited, but in general it would be a bad scene.

3You should always log why you made the commit. Don't just say what you did: Git can compare this commit's snapshot to any other commit's snapshot, which will show what you did. We can easily see that you changed line 42. Tell us why you made that change: why did "the red ball" become "the blue cube", or whatever it was that you changed. What was so bad about it being red, and a ball?


Branch names help us (and Git) find commits

Now that you know about commits—what you saw above is almost everything you need to know, although there's always more to learn—let's look at how we find the commits. Remember that the hash IDs, like H and G in these illustrations, are actually big ugly random-looking things. They're impossible for humans to remember. We could write them down, but hang on, we have a computer. Computers are good at this s—t: let's have the computer write down the hash ID. Let's store hash IDs in a table of names, like branch names and tag names.

And that's what they are: a branch name is just an entry in a table that stores a hash ID. If we have the name main in the table, and main stores the hash ID of H, we say that main points to H:

...--G--H   <-- main

We might add a new branch name dev or feature or whatever and switch to using that branch name, but that name also points to H at the moment:

...--G--H   <-- dev (HEAD), main

We (and Git) will use the special name HEAD, written in all uppercase like this, to remember which name we're using. Then we'll have Git make a new commit—a new snapshot and metadata—with some changes in it vs the snapshot that Git has stored forever in H. We'll call our new commit I, and I will point back to H like this:

          I
         /
...--G--H

The sneaky trick Git pulls is that when we have the name dev as our current branch name, and we make the new commit with git commit, Git updates the branch name to make it point to the new commit. The other names, like main, don't change, so they still point to whatever commit they pointed to before; but dev, which has HEAD attached to it, now points to I:

          I   <-- dev (HEAD)
         /
...--G--H   <-- main

That's really all there is to this part of Git, but we've left out a few big important ideas, which I'll just touch on lightly.

Git's index and your working tree

The snapshots in commits are read-only and Git-only, using those compressed and de-duplicated files. But your computer programs can't read those kinds of files, and some of your programs—like your editor—needs to be able to write files. So the committed files are useless! This is true of many version control systems (VCSes), not just Git.

To make the committed files useful, Git has to extract them from a commit, like extracting files from an archive. (In fact, it's exactly like extracting from a compressed archive, except that archiving software doesn't use Git's weird internal format, which is only useful to Git itself.) That's what git switch does, or git checkout in its branch-switching mode.4

When you work with Git, you generally work with these working copies of the files, which Git puts in something Git calls your working tree or work-tree. These files are literally not in Git: at most, they were copied out of some commit, and might eventually be copied back into a future commit. Note that if you damage or destroy these files, Git can extract any of its existing archives (commits), but can't get you any modifications you made.

In other VCSes, we get to stop here: there are the commits, and there is the working tree, and the VCS makes new commits from the working tree. But Git is weird: Git makes new commits from something else. Git calls this other thing the index, or the staging area, or (rarely now) the cache. We won't go into proper detail here but this index is why you have to run git add so often: you're telling Git copy my working tree file, which I've updated, back into your index to prepare it for committing. This compresses and de-duplicates the file (at git add time, which makes the later git commit go fast—with older VCSes one might run their commit verb, then take a coffee or lunch break because nothing was going to happen for many minutes).

Because the index holds its own copies (de-duplicated "copies") of files, you can have three different versions of some file going around at once: the HEAD, or current-commit, version is frozen for all time in the current commit, whatever its hash ID is. There's a second copy (or "copy" when it's de-duplicated) of that file in the index, initially the same as the committed copy, but you can change that copy with git add; and there's a third copy of the file in your working tree, which you can change any time with any editor or anything you like.

It's therefore important to make sure that, before you run git commit, Git's index (or staging area) holds the copies you want Git to freeze for all time into the new commit. That's why you will want to run git diff and git status and the like: to see if the index matches the HEAD commit, or the working tree, or both or neither, and if it doesn't match, git diff --cached or git diff can show you what's different about it. But that's all we'll say about that here.


4In older (pre-2.23) Git versions you have only git checkout, so you must use the older command; it's slightly dangerous due to a minor design mistake. This is corrected in Git 2.23, but you might as well start using git switch too if you have Git 2.23 or later.


Merging and merge commits

Eventually we always find ourselves having merge commits, for one reason or another. Merges are good when they do what you want. They're bad when they don't. But because we get them and make them, we need to look at what they are about and how we make them.

We'll start by imagining our little repository has these commits in it:

          I--J   <-- br1 (HEAD)
         /
...--G--H
         \
          K--L   <-- br2

That is, we have at least two branches, named br1 and br2. Both branches started from some shared commits, which are the commits up through H, but after that someone made commits I and J on br1, and someone made commits K and L on br2.

We would now like to combine the work these two "someone"s did. To do that, we pick one branch (br1) to be the current branch and commit, as we have done here. Then we run:

git merge br2

The br2 here is really just a way to have Git find the commit we want merged. Branch names are normally used to find the the commit (or commits), and this is no exception: Git finds commit L because br2 points to L.

The merge code now works its way backwards, as Git tends to do, from commit J—the current or HEAD commit on br1—and from commit L, at the same time. This working-back discovers that commit H is the best shared commit, which Git calls the merge base.5

Git now does the work-finding step. To figure out what we changed on br1, Git simply compares the snapshot in H to the snapshot in J: the tip commit of br1, where we are now. This gets a diff that has a recipe that says to change particular lines of particular files:

git diff --find-renames <hash-of-H> <hash-of-J>   # changes on br1

To figure out what they (whoever they are) did on br2, Git runs a second diff, starting from the same merge base H, but to L this time:

git diff --find-renames <hash-of-H> <hash-of-L>   # changes on br2

The merge process is now a simple matter of combining the changes. Whatever lines we changed, in whatever files, Git will make those changes, but Git will also make their changes, to whatever lines of whatever files they changed. Git will apply the combined changes to the snapshot from commit H—the shared starting point, i.e., the merge base.

Conflicts, if they occur, generally happen because we and they changed the same lines, or lines that "touch", in the same files, but we and they did different things to those lines. Git doesn't know whether to take ours, theirs, or both. Git itself simply stops with a merge conflict and makes you, the programmer, figure out the correct result. (Then you must finish the merge.)

If there are no such conflicts, however, Git will go on to make the merge commit on its own. As with any commit, the merge commit will use the current commit as a parent. What's special about a merge commit is that it will get a second parent: the commit we told Git to merge. In this case, that's commit L:

          I--J
         /    \
...--G--H      M   <-- br1 (HEAD)
         \    /
          K--L   <-- br2

Like every commit, merge commit M has a snapshot, with all of the files in it (compressed and de-duplicated as usual). Like any commit, merge commit M has metadata. The only thing special about M is that its metadata lists two parents instead of the usual one.6 If there were merge conflicts, the snapshot in M is the one you, the programmer, told Git to make. Otherwise it's the one Git figured out on its own (unless you used --no-commit to tweak the result, as in an evil merge).

There is one especially-interesting side effect here, though. Commits K-L used to be only "on" branch br2. Now they're on both branches. Commits I-J are still only on br1. The set of branches that contain any given commit is found by starting with all branch names, finding the commits they point to, and working backwards. If this process hits the commit you want to find out about, that commit is contained within that branch. If more than one branch name contains the commit, more than one branch contains the commit. (This, too, is peculiar to Git; most version control systems don't take this view.)


5In the illustration, we can work back one step at a time on both branches and meet at H, but in real cases we might have to go back just one hop on one branch, and many hops on another, for instance. The actual algorithm for finding merge bases is the Lowest Common Ancestor algorithm. We'll skip over a number of important details here, including what Git does about the multiple-merge-bases case.

6A merge commit in Git can list more than two parents. We won't cover how we have Git do this, but it doesn't require anything that we haven't already covered: it's still a commit with a snapshot and metadata.


Remotes and remote-tracking names

Your branch and 'origin/devel' have diverged ...

The name origin/devel is what I call a remote-tracking name. Git calls these remote-tracking branch names, but I find the word branch here adds nothing and just makes for confusion. origin/devel is a name, and it does "track something remote", but it's not a branch name: no more than a tag name is a branch name for instance.

What's going on here is that your Git software is talking to another Git repository. Your repository has branch names, like devel or whatever. But these are your branch names: you run git switch devel, and then do some work and run git commit, and now the hash ID stored in the branch name is the hash ID of some new commit I that you made.

Your Git software must therefore be careful not to stomp on your branch names, just because their Git repository, over on GitHub or wherever it might be, has a branch name devel. So your Git will create or update the name origin/devel instead, every time you use the name origin—a remote, as Git calls it—to have your Git software call up their Git software and get any commits they have that you don't. Your Git sees that they have a branch named devel, so your Git gets their latest devel commit hash ID. If your Git repository doesn't have that commit, your Git gets that commit from them and stuffs it into your repository, along with any parent commits needed, and so on, and now you have their devel branch. But you can't have it called devel, so your Git software uses origin/devel instead:

          I--J--K   <-- devel (HEAD)
         /
...--G--H
         \
          L   <-- origin/devel

This setup is the same as when we had two branches in your repository. In fact, we do have two branches, if by branch we mean interesting subset of commits rather than thing found via branch name. (See also What exactly do we mean by "branch"?)

We can merge these:

          I--J--K
         /       \
...--G--H         M   <-- devel (HEAD)
         \ ______/
          L   <-- origin/devel

and when we do, commit L is now on devel.

The HEAD~n syntax

You ran:

git rebase -i HEAD~9

When we have simple, linear graphs:

...--G--H--I--J--K--L   <-- main (HEAD)

the tilde ~ syntax, HEAD~5 for instance, simply counts back five commits: L, K, J, I, H. So HEAD~5ormain~5means *commitH`*. There's only one path backwards and we take that one path and count hops and that's it.

When we have complicated graphs, with branch-and-merge constructions in them:

          G--H--I
         /       \
...--E--F         M--N   <-- devel (HEAD)
         \       /
          J--K--L

this is no longer the case. We now have two paths backwards. Starting from HEAD or devel, which means commit N, we can move back one hop to M. But now we can move back one more hop to either I or L. Which hop should we take?

Git's answer to this is to designate one of the parent links as the first parent. All others are "less interesting" (though if you want, you can pick them out with the hat or caret ^ suffix). So HEAD~5 here means follow the first parent out of M. Counting five commits on this path, we get N, M, I, H, G: HEAD~5 means commit G.

Going one more hop, HEAD~6, means commit F. Note that having arrived at G, there's only one path backwards to F. Interestingly, a branch apart in the "forwards" direction—the direction Git doesn't use—is a merge together in the "backwards" direction that Git does use, and a merge commit acts as a branch-apart, not a merge-together, when we work backwards. Since Git does work backwards, Git's view is very different from the average human's here.

When you want to understand these things in detail, always look at it from Git's point of view:

and have 6 and 2 different commits each, respectively

This means that your current branch (presumably devel) has six commits on it that origin/devel lacks, and origin/devel has two commits on it that your branch lacks. If we draw that out, we get a picture like this:

       E--F--G--H--I--J   <-- devel (HEAD)
      /
...--D
      \
       K--------------L   <-- origin/devel

If we count hops from J backwards, it takes 7 to get to D. So to name commit D you'd just want HEAD~7—or you can run git log or git log --decorate --graph --oneline and cut and paste the raw hash ID of the commit, which is often easier than counting hops like this.

We're finally ready for rebase

As noted above, when we use git rebase, we're saying there's good stuff in our commits, but we don't like something about them. What exactly we don't like determines the kind of rebase we'll run. Rebase is a high-powered tool, and something of a Swiss Army chainsaw of a command, but at its heart it copies commits.

Before we get into the details, we should note that there is a Git command that copies one commit, that is much simpler to use: git cherry-pick. To use git cherry-pick, we check out / switch to some branch:

git switch copy-goes-here

resulting in:

...--o--o--P--C--o--o   <-- somebranch
      \
       o--o--H   <-- copy-goes-here (HEAD)

where the os represent commits whose hash IDs we don't really care about. Meanwhile commit C is the one we want to "copy". It makes some change—that is, its snapshot, when compared to that of its parent P, says change this file and/or that other file, and the change it makes is the change we want to make to commit H. Commit C also has a commit log message and an author and so on in its metadata, and git cherry-pick will copy most of these as well. So we now run:

git cherry-pick <hash-of-C>

Git finds the diff from P to C and applies that diff to our current commit H. Technically, Git uses the same internal machinery that it uses for a merge, as it runs two git diff commands, one from P to H to see "what we changed" (to keep these changes) and one from P to C to see "what they changed" (to add their changes). It then combines the changes, just like any merge. The special thing it does then, though, is to make a new commit C' that's like C but:

  • has H as its parent; and
  • has us as the committer (separate from the author in the metadata)

and it is not a merge commit. It does not "remember" the hash ID of commit C (although you can use -x to add this to the log message).

The new copy C' gets a different hash ID and we now have:

...--o--o--P--C--o--o   <-- somebranch
      \
       o--o--H--C'  <-- copy-goes-here (HEAD)

Note that Git updated the branch name as usual, so copy-goes-here now points to C', which points back to H. The original commits P and C are undisturbed (necessarily so: no commit can ever be changed).

A typical rebase is used to, well, re-do the "base" of some branch:

          A--B--C   <-- topic (HEAD)
         /
...--o--*--D--E   <-- origin/topic

What we like about commits A-B-C is, well, everything except that the parent of commit A is commit *. We want to put A-B-C after commit E:

          A--B--C   [abandoned]
         /
...--o--*--D--E   <-- origin/topic
               \
                A'-B'-C'  <-- topic (HEAD)

The original A-B-C chain of commits can't be changed, or even destroyed yet, but we can simply stop using them. Git forces the name topic to point to C' instead of C, in the end. Since humans don't remember hash IDs, they'll use the name topic to find the latest commit and get C' instead of C. They're so dumb, they'll think C' is C. And that's quite often exactly what we want. (It's probably something you would like, in this case, but you can do that later.)

Anyway, for rebase to do its job, then, Git needs to know:

  • Which commits should Git copy?
  • Where should Git put the copies?

The git rebase command cleverly gets both pieces of information from a single argument. We run:

git switch topic

so that we're on the right branch to begin with. Then we run:

git rebase origin/topic

The name origin/topic selects commit E. That's a commit Git should not copy, but Git uses its usual scan-backwards trick too: Git won't copy D either, nor *, nor any of the earlier commits.

The commits that Git will copy, if not stopped by a "don't copy", are those that end where the current commit is, i.e., commits C, and B, and A, and *, and everything earlier. But the origin/topic tells Git: don't copy * or anything earlier. So that stops Git from copying too many commits.

Meanwhile, the where to put the copies is: after commit E, as found by the name origin/topic. So the copies go after E.

A non-interactive rebase now simply checks out commit E directly, then copies A, B, and C one at a time, as if with, or literally with, git cherry-pick. Then it yanks the name topic—the branch we were on—over to point to the last-copied commit.

An interactive rebase does the same thing, but opens the editor on the list of pick instructions for each cherry-pick.

Rebasing with merges

In your case, you made a mistake of sorts (not really a big one, but it would have been easier to do this without it): you had Git merge your devel with origin/devel. That is, from:

       E--F--G--H--I--J   <-- devel (HEAD)
      /
...--D
      \
       K--------------L   <-- origin/devel

you had Git run git merge origin/devel or equivalent. (You probably ran git pull: Git newbies should generally avoid git pull, in my opinion, but that's just an opinion.)

The result was this:

             E--F--G--H--I--J
            /                \
...--B--C--D                  M   <-- devel (HEAD)
            \                /
             K--------------L   <-- origin/devel

You then ran git rebase -i HEAD~9. Counting 9 hops backwards, along the first parent of merge commit M, reaches commit B. So the commit you tell git rebase -i not to copy is commit B, or anything earlier.

Rebase now lists out all the commits that it should copy. As you found, the default is to drop merge commits entirely. The remaining commit list is: C, D, then one of E or K and then the rest of the commits along that row,7 then the commits along the other row. Commit M, since it is a merge, is dropped from the pick list.

If you leave the list unmodified and let the merge operation run, it will copy this set of commits, one at a time. The copies will go after commit B, so we'll have something like this:

...--B--C'-D'-E'-...-L'  <-- devel (HEAD)
      \
       C--D--...--M   [abandoned]
           \     /
            K---L   <-- origin/devel

However, git rebase tries to be clever, if it can, and re-use C and D directly; it can, so it will, unless you tell it not to; but the effect is very similar to what I drew here. If you alter the pick lines, you can of course get many other effects, and we don't know whether K'-L' come right after D or right before the end.

If you do prefer the merge, however, you can use git rebase -i -r. When you do this you get a more complicated instruction sheet, with label commands in it. This helps the rebase code rebuild the merges. Instead of just running a series of cherry-pick commands, Git can remember hash IDs as it works, and then run git merge commands. So it can produce this:

             EFI-H'-GJ
            /         \
...--B--C--D           M'  <-- devel (HEAD)
            \         /
             K-------L   <-- origin/devel

if you want, by leaving the original D and K and L alone, combining E and F and I into one new commit, copying H to H', combining G and J into one commit, and then doing a git merge to merge GJ and L together to make M'. (As usual the originals will still all be in the repository, undisturbed, just a bit hard to find: ORIG_HEAD and reflogs will hold their hash IDs for some time though.)

Using git rebase -i -r <hash-of-D> (or HEAD~7, but I usually find git log and cut-and-paste is easier here) would be a way to do this all at once. It all depends on what you want as your final outcome, and how comfortable you are trying to do it in a single step, vs multiple separate steps.


7Rebase uses --topo-order internally to ensure this; without --topo-order there are more possible orders that intermix the rows.

torek
  • 448,244
  • 59
  • 642
  • 775