- Is there any difference between the number of conflicts when doing merge to a branch as opposed to rebase a branch? why is that?
The verb is is, I think, overreach here. If we change that to can there be, the answer is definitely yes. The reason is straightforward: rebase and merge are fundamentally different operations.
- When doing a merge the merging changes are stored in the merge commit itself (the commit with the two parents). But when doing a rebase, where is the merge being stored?
This question presupposes something that's not the case, though it's minor in some aspects. To explain what's going on, though, it's no longer minor.
Specifically, to understand all of this, we need to know:
- what commits are, exactly (or at least in pretty good detail);
- how branch names work;
- how merge works, reasonably-exactly; and
- how rebase works, reasonably-exactly.
Any small errors in each of these get magnified when we combine them, so we need to be pretty detailed. It will help to break rebase down a bit, as rebase is essentially a series of repeated cherry-pick operations, with a bit of surrounding stuff. So we'll add "how cherry-pick works" to the above.
Commits are numbered
Let's start with this: Each commit is numbered. The number on a commit is not a simple counting number, though: we don't have commit #1, followed by #2, then #3, and so on. Instead, each commit gets a unique but random-looking hash ID. This is a very big number (currently 160 bits long) represented in hexadecimal. Git forms each number by doing a cryptographic checksum over the contents of each commit.
This is the key to making Git work as a Distributed Version Control System (DVCS): a centralized VCS like Subversion can give every revision a simple counting number, because there is in fact a central authority that hands out these numbers. If you can't reach the central authority at the moment, you cannot make a new commit either. So in SVN, you can only commit when the central server is available. In Git, you can commit locally, any time: there is no designated central server (though of course you can pick any Git server and call it "the central server" if you like).
This matters most when we connect two Gits to each other. They will use the same number for any commit that is bit-for-bit identical, and a different number for any commit that isn't. That's how they can figure out whether they have the same commits; that's how the sending Git can send to the receiving Git, any commits that the sender and receiver agree that the receiver needs and the sender wants the receiver to have, while still minimizing data transfer. (There's more to it than just this, but the numbering scheme is at the heart of it.)
Now that we know that commits are numbered—and, based on the numbering system, that no part of any commit can change either, once it's made, since this just results in a new and different commit with a different number—we can look at what's actually in each commit.
Commits store snapshots and metadata
Each commit has two parts:
A commit has a full snapshot of every file that Git knew about, at the time you, or whoever, made that commit. The files in the snapshot are stored in a special, read-only, Git-only, compressed and de-duplicated format. The de-duplication means that there's no penalty if there are thousands of commits that all have the same copy of some file: those commits all share that file. Since most new commits one makes mostly have the same versions of the same files as some or most earlier commits, the repository doesn't really grow much at all, even though every commit has every file.
Apart from the files, each commit stores some metadata, or information about the commit itself. This includes things like the author of the commit and some date-and-time-stamps. It includes a log message, where you get to explain to yourself and/or others why you made this particular commit. And—key to Git's operation, but not something you manage yourself—each commit stores the commit number, or hash ID, of some previous commit or commits.
Most commits store just one previous commit. The goal with this previous commit hash ID is to list the parent or parents of the new commit. This is how Git can figure out what changed, even though each commit has a snapshot. By looking up the previous commit, Git can obtain the previous commit's snapshot. Git can then compare the two snapshots. The de-duplication makes this even easier than it would be otherwise. Any time the two snapshots have the same file, Git can just say nothing at all about this. Git only has to compare files when they are actually different in the two files. Git uses a difference engine to figure out what changes will take the older (or left-hand-side) file and convert it to the newer (right-hand-side) file, and shows you those differences.
You can use this same difference engine to compare any two commits or files: just give it a left and right side file to compare, or a left and right side commit. Git will play the Spot the Difference game and tell you what changed. This will matter for us later. For now, though, just comparing parent and child, for any simple one-parent-one-child commit pair, will tell us what changed in that commit.
For simple commits with one child pointing backwards to one parent, we can draw this relationship. If we use single uppercase letters to stand in for hash IDs—because real hash IDs are too big and ugly for humans to work with—we get a picture that looks like this:
... <-F <-G <-H
Here, H
stands in for the last commit in the chain. It points backwards to earlier commit G
. Both commits have snapshots and parent hash IDs. So commit G
points backwards to its parent F
. Commit F
has a snapshot and metadata, and therefore points backwards to yet another commit.
If we have Git start at the end, and just go backwards one commit at a time, we can get Git to go all the way back to the very first commit. That first commit won't have a backwards-pointing arrow coming out of it, because it can't, and that will let Git (and us) stop and rest. That's what git log
does, for instance (at least for the simplest case of git log
).
We do, however, need a way to find the last commit. This is where branch names come in.
A branch name points to a commit
A Git branch name holds the hash ID of one commit. By definition, whatever hash ID is stored in that branch name, is the end of the chain for that branch. The chain might keep going, but since Git works backwards, that's the end of that branch.
This means that if we have a repository with only one branch—let's call it main
, as GitHub do now—there's some last commit and its hash ID is in the name main
. Let's draw that:
...--F--G--H <-- main
I've gotten lazy and stopped drawing the arrows from commits as arrows. This is also because we're about to have an arrow-drawing problem (at least on StackOverflow where the fonts are potentially limited). Note that this is the same picture we had a moment ago; we've just figured out how we remember the hash ID of commit H
: by sticking it into a branch name.
Let's add a new branch. A branch name has to hold the hash ID of some commit. Which commit should we use? Let's use H
: it's the commit we're using now, and it's the latest, so it makes a lot of sense here. Let's draw the result:
...--F--G--H <-- dev, main
Both branch names pick H
as their "last" commit. So all commits up through and including H
are on both branches. We need one more thing: a way to remember which name we're using. Let's add the special name HEAD
, and write it in after one branch name, in parentheses, to remember which name we're using:
...--F--G--H <-- dev, main (HEAD)
This means we're on branch main
, as git status
would say. Let's run git checkout dev
or git switch dev
and update our drawing:
...--F--G--H <-- dev (HEAD), main
We can see that HEAD
is now attached to the name dev
, but we're still using commit H
.
Let's make a new commit now. We'll use the usual procedures (without describing them here). When we run git commit
, Git will make a new snapshot and add new metadata. We might have to enter a commit message first, to go into the metadata, but one way or another we'll get there. Git will write all of this out to make a new commit, which will get a new, unique, big ugly hash ID. We'll just call this commit I
instead though. Commit I
will point back to H
, because we were using H
up until this moment. Let's draw in the commit:
I
/
...--F--G--H
But what about our branch names? Well, we didn't do anything to main
. We added a new commit, and this new commit should be the last commit on branch dev
. To make that happen, Git simply writes I
's hash ID into the name dev
, which Git knows is the right name, because that's the name HEAD
is attached to:
I <-- dev (HEAD)
/
...--F--G--H <-- main
and we have exactly what we want: the last commit on main
is still H
but the last commit on dev
is now I
. Commits up through H
are still on both branches; commit I
is only on dev
.
We can add more branch names, pointing to any of these commits. Or, we can now run git checkout main
or git switch main
. If we do that, we get:
I <-- dev
/
...--F--G--H <-- main (HEAD)
Our current commit is now commit H
, because our current name is main
, and main
points to H
. Git takes all the commit-I
files out of our working tree and puts into our working tree all the commit-H
files instead.
(Side note: note that the working tree files are not in Git themselves. Git just copies the Git-ified, committed files from the commits, to our working tree, here. That's part of the action of a checkout
or switch
: we pick some commit, usually through some branch name, and have Git erase the files from the commit we were working with, and put in the chosen commit's files instead. There's a lot of fancy mechanism hidden inside this, but we'll ignore all of that here.)
We're now ready to go on to git merge
. It's important to note that git merge
does not always do any actual merging. The description below will start with a setup that requires a real merge, and therefore, running git merge
will do a true merge. A true merge can have merge conflicts. The other things that git merge
does—the so-called fast-forward merge, which isn't really a merge at all, and the cases where it just says no and doesn't do anything—can't actually have merge conflicts.
How a true merge works
Let's say that at this point, in our Git repository, we have these two branches arranged like this:
I--J <-- branch1 (HEAD)
/
...--G--H
\
K--L <-- branch2
(There might be a branch name pointing to H
, or some other commit, but we won't bother drawing it in as it doesn't matter for our merging process.) We're "on" branch1
, as you can see from the drawing, so we have commit L
checked out right now. We run:
git merge branch2
Git will now locate commit J
, which is trivial: that's the one we're sitting on. Git will also locate commit L
, using the name branch2
. That's easy because the name branch2
has the raw hash ID of commit L
in it. But now git merge
does the first of its main tricks.
Remember, the goal of a merge is to combine changes. Commits J
and L
don't have changes though. They have snapshots. The only way to get changes from some snapshot is to find some other commit and compare.
Directly comparing J
and L
might do something, but it doesn't do much good in terms of actually combining two different sets of work. So that's not what git merge
does. Instead, it uses the commit graph—the things we've been drawing with the uppercase letters standing in for commits—to find the best shared commit that's on both branches.
This best shared commit is actually the result of an algorithm called the Lowest Common Ancestors of a Directed Acyclic Graph, but for a simple case like this one, it's pretty obvious. Start at both branch tip commits J
and L
, and use your eyeball to work backwards (leftwards). Where do the two branches come together? That's right, it's at commit H
. Commit G
is shared too, but H
comes closer to the ends than G
, so it's obviously (?) better. So it's the one that Git picks here.
Git calls this shared starting point the merge base. Git can now do a diff, from commit H
to commit J
, to figure out what we changed. This diff will show come change(s) to some file(s). Separately, Git can now do a diff from commit H
to commit L
, to figure out what they changed. This diff will show some change(s) to some file(s): maybe entirely different files, or maybe, where we both changed the same files, we changed different lines of those files.
The job of git merge
is now to combine the changes. By taking our changes and adding theirs—or taking theirs and adding ours, which gives the same results—and then applying the combined changes to whatever is in commit H
, Git can build up a new, ready-to-go snapshot.
This process fails, with merge conflicts, when "our" and "their" changes collide. If we and they both touched the same line(s) of the same files, Git doesn't know whose change to use. We'll be forced to fix up the mess and then continue the merge.
There's a great deal to know about how this fixing-up goes and how we can automate more of it, but for this particular answer, we can stop here: we either have conflicts, and have to fix them up manually and run git merge --continue
,1 or we have no conflicts and Git will finish off the merge itself. The merge commit gets a new snapshot—not changes, but rather a full snapshot—and then links back to both commits: its first parent is our current commit as usual, and then it has, as a second parent, the commit we said to merge. So the resulting graph looks like this:
I--J
/ \
...--G--H M <-- branch1 (HEAD)
\ /
K--L <-- branch2
Merge commit M
has a snapshot, and if we run git diff hash-of-J hash-of-M
, we'll see the changes we brought in because of "their" work in their branch: the changes from H
to L
that got added to our changes from H
to J
. If we run git diff hash-of-L hash-of-M
, we'll see the changes brought in because of "our" work in our branch: the changes from H
to J
that got added to their changes from H
to L
. Of course, if the merge stops for any reason before making commit M
, we can make arbitrary changes to the final snapshot for M
, making what some call an "evil merge" (see Evil merges in git?).
(This merge commit is also a bit of a stumbling block for git log
later, because:
- There's no way to generate a single ordinary diff: which parent should it use?
- There are two parents to visit, as we traverse backwards: how will it visit both? Will it visit both?
These questions and their answers are rather complex, but are not for this StackOverflow answer.)
Next, before we move on to rebase, let's look closely at git cherry-pick
.
1Instead of git merge --continue
, you can run git commit
. This winds up doing exactly the same thing. The merge program leaves breadcrumbs, and git commit
finds them and realizes it's finishing the merge and implements git merge --continue
rather than doing a simple single-parent merge. In the bad old days, when Git's user interface was much worse, there was no git merge --continue
, so those of us with very old habits tend to use git commit
here.
How git cherry-pick
works
At various times, when working with any version control system, we will find some reason that we'd like to "copy" a commit, as it were. Suppose, for instance, that we have the following situation:
H--P--C--J <-- feature1
/
...--G--I <-- main
\
K--L--N <-- feature2 (HEAD)
Someone is working on feature1
, and has been for a bit; we're working on feature2
right now. I've named two commits on branch feature1
P
and C
for a reason that isn't obvious yet, but will become obvious. (I skipped M
just because it sounds too much like N
, and I like to use M
for Merge.) As we go to make a new commit O
, we realize that there's a bug, or a missing feature, that we need, that the guys doing feature1
already fixed or wrote. What they did was to make some changes between parent commit P
and child commit C
, and we'd like those exact same changes now, here, on feature2
.
(Cherry-picking here is often the wrong way to do this, but let's illustrate it anyway, since we need to show how cherry-pick works, and doing it "right" is more complicated.)
To make a copy of commit C
, we just run git cherry-pick hash-of-C
, where we find the hash of commit C
by running git log feature1
. If all goes well, we end up with a new commit, C'
—so named to indicate that it's a copy of C
, sort of—that goes on the end of our current branch:
H--P--C--J <-- feature1
/
...--G--I <-- main
\
K--L--N--C' <-- feature2 (HEAD)
But how does Git achieve this cherry-pick commit?
The simple—but not quite right—explanation is to say that Git compares the snapshots in P
and C
to see what someone changed there. Then Git does the same thing to the snapshot in N
to make C'
—though of course C'
's parent (singular) is commit N
, not commit P
.
But this doesn't show how cherry-pick can have merge conflicts. The real explanation is more complicated. The way cherry-pick really works is to borrow that merge code from earlier. Instead of finding an actual merge base commit, though, cherry-pick just forces Git to use commit P
as the "faked" merge base. It sets commit C
to be "their" commit. That way, "their" changes will be P
-vs-C
. That's exactly the changes we'd like to add to our commit N
.
To make those changes go in smoothly, the cherry-pick code goes on to use the merge code. It says that our changes are P
vs N
, because our current commit is commit N
when we start the whole thing. So Git diffs P
vs N
to see what "we" changed in "our branch". The fact that P
isn't even on our branch—it's only on feature1
—is not important. Git wants to be sure that it can fit the P
-vs-C
changes in, so it looks at the P
-vs-N
difference to see where to put the P
-vs-C
changes in. It combines our P
-vs-N
changes with their P
-vs-C
changes, and applies the combined changes to the snapshot from commit P
. So the whole thing is a merge!
When the combining goes well, Git takes the combined changes, applies them to what's in P
, and gets commit C'
, which it makes on its own as a normal, single-parent commit with parent N
. That gets us the result we wanted.
When the combining does not go well, Git leaves us with the exact same mess we'd get for any merge. The "merge base" is what is in commit P
this time, though. The "ours" commit is our commit N
, and the "theirs" commit is their commit C
. We're now responsible for fixing up the mess. When we are done, we run:
git cherry-pick --continue
to finish off the cherry-pick.2 Git then makes commit C'
and we get what we wanted.
Side note: git revert
and git cherry-pick
share most of their code. A revert is achieved by doing the merge with parent and child swapped. That is, git revert C
has Git find P
and C
and HEAD
, but this time, does the merge with C
as the base, P
as "their" commit, and HEAD
as our commit. If you work through a few examples, you'll see that this achieves the right result. The other tricky bit here is that an en-masse cherry-pick has to work "left to right", older commit to newer, while an en-masse revert has to work "right to left", newer commit to older. But now it's time to move on to rebase.
2As in footnote 1 for merge, we can use git commit
here too, and in the bad old days there was probably a time when one had to, although I think by the time I used Git—or at least the cherry-picking feature—the thing that Git calls the sequencer was in place and git cherry-pick --continue
worked.
How rebase works
The rebase command is very complicated, with a whole lot of options, and we won't cover all of it by any means here. What we'll look at is in part a recap of what Mark Adelsberger got into his answer while I was typing all of this.
Let's go back to our simple merge setup:
I--J <-- branch1 (HEAD)
/
...--G--H
\
K--L <-- branch2
If, instead of git merge branch2
, we run git rebase branch2
, Git will:
List out commits (hash IDs) that are reachable from HEAD
/ branch1
, but not reachable from branch2
. These are the commits that are only on branch1
. In our case that's commits J
and I
.
Make sure the list is in "topological" order, i.e., I
first, then J
. That is, we want to work left-to-right, so that we always add later copies atop earlier copies.
Knock out of the list any commits that for some reason should not be copied. This is complicated, but let's just say that no commits get knocked out: that's a pretty common case.
Use Git's detached HEAD mode to begin cherry-picking. This amounts to running git switch --detach branch2
.
We haven't mentioned detached HEAD mode yet. When in detached HEAD mode, the special name HEAD
doesn't hold a branch name. Instead, it holds a commit hash ID directly. We can draw this state like this:
I--J <-- branch1
/
...--G--H
\
K--L <-- HEAD, branch2
Commit L
is now the current commit but there is no current branch name. This is what Git means by the term "detached HEAD". In this mode, when we make new commits, HEAD
will point directly to those new commits.
Next, Git will run the equivalent of git cherry-pick
for each commit it still has in its list, after the knocking-out step. Here, that's the actual hash IDs of commits I
and J
, in that order. So we run one git cherry-pick hash-of-I
first. If all works well, we get:
I--J <-- branch1
/
...--G--H
\
K--L <-- branch2
\
I' <-- HEAD
During the copying process, the "base" here is commit H
(parent of I
), "their" commit is our commit I
, and "our" commit is their commit L
. Note how the ours
and theirs
notions appear swapped around at this point. If there's a merge conflict—which can happen because this is a merge—the ours
commit will be theirs and the theirs
commit will be ours!
If all goes well, or you have fixed any issues and used git rebase --continue
to continue the merge, we now have I'
and we begin copying commit J
. The end goal of this copying is:
I--J <-- branch1
/
...--G--H
\
K--L <-- branch2
\
I'-J' <-- HEAD
If something goes wrong, you'll get a merge conflict. This time the base commit will be I
(which is one of ours) and the theirs
commit will be J
(still one of ours). The really confusing part is that the ours
commit will be commit I'
: the one we just made, just now!
If there were more commits to copy, this process would repeat. Each copy is a potential place to experience merge conflicts. How many actual conflicts occur depends heavily on the various commits' contents, and whether you do something, during a conflict resolution of some earlier commit, that will set up a conflict when cherry-picking a later commit. (I've had situations where every single commit being copied has the same conflict, over and over again. Using git rerere
is very helpful here, although a bit scary sometimes.)
Once all the copying is done, git rebase
works by yanking the branch name off the commit that used to be the branch tip, and pasting it to the commit HEAD
now names:
I--J ???
/
...--G--H
\
K--L <-- branch2
\
I'-J' <-- HEAD, branch1
The old commits are now hard to find. They are still in your repository, but if you don't have another name that lets you find them, they seem to be gone! Last, just before returning control to you, git rebase
re-attaches HEAD
:
I--J ???
/
...--G--H
\
K--L <-- branch2
\
I'-J' <-- branch1 (HEAD)
so that git status
says on branch branch1
again. Running git log
, you see commits that have the same log message as your original commits. It seems as though Git has somehow transplanted those commits. It hasn't: it has made copies. The originals are still there. The copies are the rebased commits, and make up the rebased branch, in the way humans think of branches (though Git doesn't: Git uses hash IDs, and these are clearly different).
Conclusion
The bottom line, as it were, is that git merge
merges. This means: make one new commit, by combining work, and tie that one new commit back to both existing chains of commits. But git rebase
copies commits. This means: make many new commits, by copying those old commits; the new commits live elsewhere in the commit graph, and have new snapshots, but re-use the old commits' author names, author date stamps, and commit messages; and once the copying is done, yank the branch name off the old commits and paste it onto the new ones, abandoning the old commits in favor of the new and improved ones.
This "abandoning" is what people mean when they say that rebase rewrites history. History, in a Git repository, is the commits in the repository. They're numbered, by hash IDs, and two Git repositories have the same history if they have the same commits. So when you copy old commits to new-and-improved ones, abandoning the old ones, you need to convince the other Git repositories to also abandon those old commits in favor of the new ones.
That—convincing other users with their Git repositories—can be easy or hard. It's easy if they all understand this in the first place and have agreed to do this in advance. Merging, on the other hand, does not throw away old history in favor of new-and-improved history: it just adds new history that refers back to old history. Git can easily add new history: that's how Git is built, after all.