Note: this answer got too long so I've split it into two parts.
You have encountered a basic stumbling block for those who try to use Git without a tutorial, or with a bad tutorial (of which there are, unfortunately, many). Your mistake here is thinking that "merge", in Git, involves two versions of some file. This is not the case! Merging in Git is on a commit basis and involves three versions of each file.
Before we can get into merging, we have to start with commits. Without the proper base,1 merging won't make any sense. So let's jump into that first. Note that you may want to read this twice before you tackle your merge.
1There's a pun here that will fly right over your head if you're not already familiar with the ideas behind merging.
Commit = snapshot + metadata
Git is, at its heart, all about commits. Git is not really about files or branches. It's true that each commit stores files, and we (or Git at least) organize commits into branches, using branch names to help us find individual commits. But it's the commit itself—or the collection of commits—that's the heart of the repository.
Git stores these commits in an object database. We won't go into most of the details here, though we will skim a few necessary parts, but if you are curious, you can peek inside the hidden .git
folder, where you will find a sub-folder named objects
. Inside this are potentially many more folders, which store the objects in various forms ("loose" and "packed" but you don't need to care about this here). Each object, including each commit, is numbered, with a unique ID that is expressed in hexadecimal, such as d420dda0576340909c3faff364cfbd1485f70376
. (This particular one is a commit in the Git repository for Git.)
This hash ID is the "true name" of the internal object, and Git literally requires it to find the object in the objects database. These names are not friendly to humans, though, so Git provides a separate secondary database—one that's not nearly as well organized, nor as well implemented, really—that stores names: branch names, tag names, and all other kinds of names. Each of these names simply stores one hash ID, which is in fact all that's necessary.
So, when you use a branch name like main
or feature
or whatever, you're really providing Git with a raw hash ID, that's been hidden behind this name. It's worth running git rev-parse
a few times to get a feel for this mechanism:
git rev-parse main
might produce:
d420dda0576340909c3faff364cfbd1485f70376
for instance, if you have a clone of the Git repository for Git (though by now its main/master has moved on; I haven't updated my clone in more than a week now for home-life reasons).
When you clone a Git repository, it's the underlying objects that you are copying. The names in your names database are yours, not the original clone's, but the objects, which are all strictly read-only once they're created, get shared. You get a copy but you (and your Git software) are forbidden from changing any of these. Not even Git can change a Git commit. You just add new commits to the repository. That's how history exists: the existing commits continue to exist. They're just now in two repositories: the original, and your clone. Make more clones, and you make yet more copies of the objects.
With that said, let's look at the anatomy of one particular commit, in this case d420dda0576340909c3faff364cfbd1485f70376
:
$ git cat-file -p d420dda0576340909c3faff364cfbd1485f70376 | sed 's/@/ /'
tree 13b45e4ccc34572dce66dc79468b66c0b383a560
parent c68bd3ec22a1afc85b0b897834b2524aedbd0553
author Junio C Hamano <gitster pobox.com> 1665507772 -0700
committer Junio C Hamano <gitster pobox.com> 1665509772 -0700
The second batch
Signed-off-by: Junio C Hamano <gitster pobox.com>
That, right there, is the entire commit object, as seen in every Git clone of the Git repository for Git. You can see that commits are pretty small! But each one has a tree
object: that first line, tree
followed by an internal object ID, is required in every commit.
The tree object in the commit represents a permanent archive of every file. This is another object (and you can git cat-file -p
it, if you like, to see how it works in great detail), but what we see in the output above is the metadata for the commit:
- the commit has a parent, with a raw hash ID: this is another commit object;
- the commit has an author and a committer: these are text strings giving the name and email address of the person who made the commit, along with some date-and-time stamps; and
- the commit has a log message, which is what you see when you run
git log
.
The git log
command uses the stored hash ID here to find the previous commit. Most commits have exactly one parent
line, but a few have mroe than one parent, making them merge commits, and at least one commit in any non-empty repository has no parent
because it was the first commit.
We say that the commit points to its parent or parents, and if we use single uppercase letters to stand in for real hash IDs—which are big, ugly, and random-looking2—we get drawings that look like this:
... <-F <-G <-H <-- main
Here, the name main
, a branch name, points to (contains the hash ID of) commit H
. Commit H
itself contains, indirectly, a full snapshot of every file—that's the tree <hash>
line—and directly contains metadata, including a parent
line. So commit H
points to earlier commit G
.
Commit G
, being a commit, contains a snapshot and metadata, so G
points to earlier commit F
. Commit F
is a commit, so it points to a still-earlier commit, which points back to an even-earlier commit, and so on, backwards, down the line to the very first commit ever (presumably commit A
in our drawing).
2They're actually not random at all, and concretely, the hash ID d420dda0576340909c3faff364cfbd1485f70376
is simply the SHA-1 checksum of the content of the above commit, except that this content is prefixed by commit 284
and an ASCII NUL byte, with 284
being the decimalized size of the rest of the object. The fact that the previous hash IDs on the parent
lines and the date-and-time-stamps are themselves unique means that the new hash ID is unique.3
3Anyone familiar with the pigeonhole principle should immediately object here. That objection is correct and means that Git will eventually fail. You can calculate the probability of failure with a fancy formula, and it turns out to be vanishingly small until the objects database holds more than about 1.7×1015 objects, at which point it starts to creep up towards the probability of undetected disk-drive errors. We live with those; we can live with SHA-1 collisions. Even so, Git is slowly moving towards SHA-256.
A special feature of a branch name
We're going to skip a lot here and just look at one special feature of branch names. We already know that the name points to a commit. We can also draw this:
...--G--H <-- main, feature
where we have a single commit, H
, that is pointed-to by more than one branch name. When this is the case, all the commits up to and including H
are on both branches. Checking out either branch, with git switch main
or git switch feature
, gets us the files from commit H
. But, as a special feature, checking out that name "attaches" the special name HEAD
to that name:
...--G--H <-- main (HEAD), feature
Here, we're on commit H
and branch main
. The files we have available to us are those from commit H
. If we now run:
git switch feature
the picture changes slightly:
...--G--H <-- main, feature (HEAD)
We're still on commit H
, but we're "on" it through the name feature
. We have the same files, but something wacky is about to happen.
Let's make a new commit now, in the usual way Git has for making commits (we modify some files and git add
and git commit
). We get a new commit, which gets a new, unique ID; we'll just call this I
and draw it in:
I
/
...--G--H
What happens to the branch names? The answer is: nothing happens to any of the ones that we are not "on", but the one that we are "on", that name gets forcibly updated so that it points to the new commit we just made:
I <-- feature (HEAD)
/
...--G--H <-- main
If we make another new commit, we get:
I--J <-- feature (HEAD)
/
...--G--H <-- main
Commits I-J
are clearly "on" the feature
branch. Commits up through H
are clearly "on" the main
branch. Surprisingly—or not, depending on your point of view and whether you've used other version control systems—Git declares that commits up through H
are on both branches at this time.
Forming more branches, and a side note on the word branch
Let's now switch back to main
, with git switch main
:
I--J <-- feature
/
...--G--H <-- main (HEAD)
Git will rip away our commit-J
files and put back the commit-H
files. The "true" files are safely saved away in the tree
objects for each commit. The files we see and work with are in fact not in the repository at all, they're just copied out of the repository.
We can now create and switch to another name, perhaps dev
or re-feature
. Or if we like, we can commit on main
. It doesn't really matter to Git, as the branch name is merely a sort of label, pointing to the commit. Later, if we decide we want to have main
stay with commit H
, we can make a new name that points to the new commits we're about to make, and then force the name main
back to commit H
.
This is a key weirdness in Git: branch names aren't really very important at all, except in that they help us by finding the last commit. Whatever commit the name points to is the last commit in the branch. If we move the name, we've changed which commit is the last commit in the branch. This is also why we say that some commit is "on" a branch if we can get there by starting at the last commit (found by the name) and working backwards. In effect, commits are "contained in" their branches more than they are "on" any one (single) branch.
Rather than mucking about with moving main
forward and backwards, let's do:
git switch -c feat2
now, to get:
I--J <-- feature
/
...--G--H <-- main, feat2 (HEAD)
and make two more commits:
I--J <-- feature
/
...--G--H <-- main
\
K--L <-- feat2 (HEAD)
Now, just for the heck of it, let's delete the name main
. (It's kind of in the way of what we're about to do.) This gives us:
I--J <-- feature
/
...--G--H
\
K--L <-- feat2 (HEAD)
Commit H
, which was on three branches, is now only on two branches. The branch main
has ceased to exist. Or has it? If we create a new name, old-main
, pointing to H
:
I--J <-- feature
/
...--G--H <-- old-main
\
K--L <-- feat2 (HEAD)
commit H
is now on three branches again.
The number of branches that some commit is "on" is not important. The word branch in Git is rather badly overloaded (see What exactly do we mean by "branch"?) and when you see it without context, you should be careful: it may not mean anything, and the person who used it might not be aware of what they're saying. Or it may mean any of the various things that it can mean, such as "branch name", "remote-tracking name", "tip commit", and "set of commits ending at a particular commit".
Onward: git merge
With all that out of the way, let's change our two feature
names to br1
and br2
so that they're easier to type in, and switch to br1
and run git merge br2
now:
I--J <-- br1 (HEAD)
/
...--G--H
\
K--L <-- br2
We seem to have asked Git to merge "branch br2
" into "branch br1
". And that's sort of true. But in fact, what we're really asking Git to merge are commits, namely J
and L
. As always, each of these two commits represent a snapshot-plus-metadata.
Because we are "on" br1
, the files we have available to us, before we run git merge br2
, are those from commit J
. The files we're asking Git to merge exist in br2
, i.e., in the tip commit of br2
, i.e., in commit L
. But in order to perform the merge, Git cannot simply compare the files in J
to the files in L
. You might wonder why not, and we'll provide a simple example, but before we get there, let's consider the goal of a merge in the first place.
The goal of a standard git merge
operation is to combine work. If we are going to combine work, we first have to define work. What is the work in a commit?
The work in an individual commit
Let's take a really concrete example: commit d420dda0576340909c3faff364cfbd1485f70376
. If you click on this link, or clone the Git repository for Git and run git show d420dda0576340909c3faff364cfbd1485f70376
, you get a bunch of text shown that includes this:
diff --git a/Documentation/RelNotes/2.39.0.txt b/Documentation/RelNotes/2.39.0.txt
index a26c82444b..a6ee7c8996 100644
--- a/Documentation/RelNotes/2.39.0.txt
+++ b/Documentation/RelNotes/2.39.0.txt
@@ -66,9 +66,25 @@ Fixes since v2.38
led to a segfault (which is bad), which has been corrected.
(merge 92481d1b26 js/merge-ort-in-read-only-repo later to maint).
+ * Force C locale while running tests around httpd to make sure we can
+ find expected error messages in the log.
+ (merge 7a2d8ea47e rs/test-httpd-in-C-locale later to maint).
[snip]
This is obviously some git diff
output. It shows us a change to one file, Documentation/RelNotes/2.39.0.txt
. But if a commit is a snapshot (plus metadata)—and it is—how can there be a change to a file? The snapshot is just an archive, like a zip file or tarball or whatever. The answer is that this particular commit has a (single) parent commit, whose hash ID we see above. If we have Git extract both of these commits and compare them, we'll find that this file, Documentation/RelNotes/2.39.0.txt
, is different in the two commits—and the git diff
output from comparing these two commits is just what we see with git show
, or on the GitHub page.
Work done over multiple commits
So the difference between some commit and its parent represents the work done in that particular commit, and that's the definition we will start with. Let's go back to our simple stylized graph:
I--J <-- br1 (HEAD)
/
...--G--H
\
K--L <-- br2
and look at the work done in, say, commit I
. We'll find this "work" by comparing the snapshot in H
to the snapshot in I
. Maybe we added one new file, new.txt
, and did nothing else. We can look at the work done in commit J
too: maybe we edited old.txt
and README.md
. The changes to those two files is the work done in J
.
Now, what about commits K
and L
? Maybe in commit K
we modified foo.py
, and in commit L
we made another change to foo.py
and also modified README.md
.
If we compare commit J
with commit L
, then, we'll see the following:
- delete
new.txt
(it's in J
, where we added it because of I
, but it is not in L
);
- undo the change we made to
old.txt
(we did that in J
);
- modify
foo.py
(we made two changes to that in K
and L
), and modify README.md
to add whatever we did in L
, and to undo whatever we did in J
.
This is clearly no good! We don't want to remove new.txt
at all. Maybe we can compare in the other direction, so that we add new.txt
. But we already have a new.txt
, and this comparison will tell us to undo whatever changes we made in foo.py
. That, too, is no good.
No: What we need is to identify the work done on br1
first, separately. We can do that pretty easily by comparing the snapshot that's in H
to the one that's in J
. That will show us:
- the new file added; and
- the change we made to
README.md
.
That's the "work done in br1
", or more precisely, the work done in commits I-J
.
Once we have that, we can try to identify the work done on br2
. We can do that pretty easily too, by comparing the snapshot that's in H
to the one that's in L
. That will show us:
- the two changes made to
foo.py
; and
- the change we made to
README.md
.
That's exactly what we want! But hang on a moment:
- Why did we go back to commit
H
? Why not back to commit G
?
- How did we pick commit
H
in the first place?
- How do we combine these changes?
The answer to the first two questions is git merge-base
.
The merge base
If we just look at the picture we drew:
I--J <-- br1 (HEAD)
/
...--G--H
\
K--L <-- br2
it's stunningly obvious why we picked commit H
. Commit H
is on both branches. So is commit G
, of course, and so are all the commits before G
, but commit H
is the last of these shared commits. Going further back in time, to an even-earlier shared commit, nets more overall changes, where "both sides" will make the same changes, and there's no profit in doing that. So commit H
, the last shared commit, is also the best shared commit.
There are cases where the best shared commit is not obvious at all, and there are ways to draw the graph that make it less obvious that H
is the best shared commit. Git has, built into it, an implementation of the Lowest Common Ancestor algorithm (as extended to DAGs like Git's commit graph), so that git merge-base
can find the shared commit, and git merge
uses that by default.4 We thus usually don't have to think about this: we just run git merge br2
and Git finds the best shared commit, in this case H
, and does its thing.
Not having to think and worry about it, though, does not mean we can ignore it. We must realize and remember that when we run git merge
, Git is going to find a merge base commit.5 This merge base supplies the third version of each file.
4I've mentioned this "by default" a few times, and that is because git merge
allows you to specify a merge strategy. There's a merge strategy, -s ours
, that means ignore everything they did. This strategy doesn't bother finding a merge base at all. There are some other fancier strategies that do complicated stuff, but we won't cover those here.
Git's "merge strategy" -s
argument should not be confused with Git's strategy-option, or -X
, arguments to git merge
. I like to call these eXtended options to keep them apart in my head. The -X
options are passed to the strategy, which then does whatever it does with them. Since we almost always use the default strategy in the first place, most people seem to think of the -X
extended options as options to git merge
, but they're actually specific to the strategy. (This is all very confusing, and perhaps was a bad idea, rather like Douglas Adams' famous quote: "In the beginning the Universe was created. This has made a lot of people very angry and been widely regarded as a bad move.")
5There are particularly nasty cases where there is more than one "best" merge base commit, and for these cases, Git defaults to merging the merge bases to come up with a "virtual merge base". With any luck, you will never experience the wonders, er, horrors, er, terror ... this particular case yourself. Seriously, it's not usually terrible, but occasionally, it is really awful and ugly. You can check to see if this will happen or has happened using git merge-base --all
: if that spits out more than one hash ID, you've hit the multiple-merge-bases case, and it's time to find a helpful StackOverflow article.
How Git handles three versions of each file
Now that we know that git merge br2
really uses three versions of each file:
I--J <-- br1 (HEAD)
/
...--G--H
\
K--L <-- br2
with the merge base version of each file coming from commit H
, we can see how Git uses these.
Let's start with the easy cases, with our hypothetical example. Here, we modified foo.py
in br2
—once in each new commit—but we didn't touch foo.py
at all in our br1
commits. So the diff from H
to J
shows nothing for this file, while the diff from H
to L
shows the two changes.
To combine nothing with something, Git takes the something. That was easy! Git then applies the "something" to the copy from commit H
, which is also the copy from commit J
, which is also the copy we have sitting in our working tree. So the result is that the changes we made in K
and L
show up in our working tree.
Let's take another easy case: the file NOTES.md
in all three commits match. To do nothing at all on the left (in our changes on I-J
), and nothing at all on the right (in their changes on K-L
), Git does nothing: it takes any of the three copies of NOTES.md
, from any of the commits, such as the one we already have in our working tree, and just leaves it alone.
Let's take a third easy case: the file new.txt
does not exist in the merge base commit H
, and does not exist in commit L
, but is there in commit J
and in our working tree. Git combines the "create" with the "do nothing" to create the file, i.e., leave it in our working tree (and in Git's index, about which we'll say more in a moment).
Had we created a new file in "their" commits (K-L
), or deleted a file on either side, or whatever, Git would take that change—create new file, or delete file—and copy that across. Any time one side does something and the other side doesn't do something, we take the "something".
This leaves us with the hard case, or at least, the potentially hard case: we did something, and they did something, to the same file.
Combining changes to one file, and how Git expands its index
There's an important thing we have not mentioned at all here yet, and it's a big topic that I won't really cover properly, which is this: Git does not make commits from what's in your working tree. Git makes commits from what is in Git's index. This thing, this "index", is crucial in Git because Git uses it to make new commits. It's so important, and perhaps so poorly named (what the heck does index mean anyway?), that it actually has three names:
- when called "the index", as I do here, we can refer to everything it does;
- when called "the cache", as Git mostly only does in flags now (
git rm --cached
), it refers to how Git uses it to speed stuff up; and
- when called the staging area, which is perhaps the best name for how you use it, Git is describing how you use it: to "stage" the next commit.
What's in the index or staging area is, to put it briefly and gloss over some details, a sort of a copy of each file that is going to be committed if you run git commit
right now. That is, when you first switch to some commit, Git not only extracts that commit's files to your work area, so that you can see them and work on them. Git also extracts the same files to Git's index aka staging area, so that they're all staged to go into the next commit in exactly the same form they have in this commit.
The existence of this index / staging-area is why you have to run git add
every time you change a file. The git add
command tells Git:
- open and read the working-tree copy of the file;
- compress the data down to the internal form for a loose object;
- check to see if we already have the file data (i.e., is this a duplicate?);
- make either the original object (if duplicate), or this now-prepared object, ready for committing.
So we will re-use the original if this is a duplicate. Otherwise, the next commit will use this data, which has never been committed before. Either way, at the end of git add
, the working tree version of the file is now in Git's index, ready to be committed.
The de-duplication that happens in this step is a big part of how Git keeps the commits from bloating up the repository, even though every commit stores every file every time. Most commits are mostly duplicates, and these duplicate copies take no space because they're de-duplicated. It's during git add
, not git commit
, when Git actually does the duplication checking and de-duplicating. So if you don't force Git to re-add every file every time,6 git add
and git commit
go very fast.
Now, this is the normal condition of the index, when we're not in the middle of a merge. But when you run git merge
, Git expands the index, creating three extra "slots" for each file:
- slot zero, if it's used, is the normal ready-to-commit copy: if this index entry is occupied, the other three slots are erased and the file is not conflicted;
- slots 1, 2, and 3 hold the merge base, "ours", and "theirs" copy of the file: if any of these slots are occupied, slot zero is erased and the file is conflicted.
Hence the way git merge
works—at least from a high level viewpoint7—is this:
- the "ours" copy of each file moves from slot zero to slot 2;
- the merge base copy of each file goes into slot 1; and
- the "theirs" copy of each file goes into slot 3.
At this point, we have all three slots filled for any file that appears in all three commits. Now we just take care of each possible case:
All three slots hold the same copy of the file: nobody touched it at all, just use any copy, collapse it all down to slot zero.
Slots 1 and 2 match and slot 3 is different, or slots 1 and 3 match and slot 2 is different: we or they touched the file, and they or we didn't, so take the modified file, whichever slot that's in, and move that to slot zero and erase the other two.
Slot 1 is empty, slot 2 is occupied, slot 3 is empty; or slot 1 is empty, slot 2 is empty, and slot 3 is occupied: we or they added the file. Put the non-empty entry into slot 0 and erase all the others.
Slot 1 is not empty, and matches what's in 2 or 3, but the other of 2 or 3 is empty: one of us removed the file and the other of us didn't, so remove the file, by removing all the entries (no slots left at all).
Some of the remaining cases are messy and I leave it as an exercise to work them out (consider, e.g., "slot 1 empty, slots 2 and 3 both have files in them", which may be an add/add conflict). The usual hard case is the one where slots 1, 2, and 3 are all occupied with different copies of the file: that's your standard "merge conflict".
6Note that running git add .
makes Git check for changes via various magic OS-dependent file-system tricks, which is normally much faster than re-compressing every file. This is where that "cache" aspect of the index comes in. You have to defeat this cache trick to really see the timing difference, and these details are beyond the scope of this answer, though I'll note that the --renormalize
option is the mostly-portable way to mostly do most of this.
7The git merge
code takes care not to bother expanding the index for unconflicted files, as most files are mostly unconflicted and this makes everything go a lot faster. But that complicates the code a lot; the simplified view where we do the expand-then-combine-then-shrink is a whole lot easier to think about, and gives the same result, just a bit slower.
This also skips over the whole "renamed file" identification process, which is kind of tricky.
on to part 2