This answer is in two parts, because one is all background and one is the answer to the question you asked. You should probably read this part first, even though it's just background.
Background (Long)
I think that the Q-and-A pair in question here (the linked question plus its accepted answer) are just not very good. The tricky part here is that Git is both simpler and more complex than people think, and people get wrong ideas into their heads, which takes a lot of work to get rid of and replace with the correct model.
The wrong model that people have in mind is that branches are somehow the thing in Git. But they're not: they're not the thing, whatever "the thing" may mean. The problems are that "branches"—whatever we may mean when we use that word loosely—are ambiguous, and when we mean "some subset of commits in a Git repository", they're simply a consequence. That is, these branches are like the fact that you have to breathe hard after sprinting. You don't sprint so that you'll have to breathe hard: you sprint to win a race, or to get exercise, or something along those lines. The breathing-hard part happens, but it wasn't the goal.
Similarly, branches (whatever we might mean by that) happen in Git because of the thing we—and Git—really do care about. "The thing", in this case, is the commit. Git is all about commits. Commits are the raison d'être for Git. As such, it's crucial to understand the following:
A Git repository is a collection of commits. In fact, a repository is, at its heart, two databases. Both are simple key-value stores. One holds Git's internal objects, including the commits (which are the objects we humans will generally care about here), and the other holds names—branch names, tag names, and other names.
Commits are numbered. Every internal Git object gets a number; commits in particular get a globally unique number, which we call a hash ID, or sometimes a Git OID (Object ID). In the past, Git called these SHA-1 hash IDs (because the current OIDs are in fact SHA-1 hashes), but Git is moving to SHA-2 due to SHA-1 having been effectively broken.
Each commit in turn stores two collections. We'll get back to this in a moment.
The fact that each commit has a totally unique number means that any two Git repositories, on contact with each other, can tell whether they contain the same commits just by looking at the numbers. Your Git software, working with your Git repository, can reach out to other Git software working with another Git repository: you might call this other Git origin
for instance. Your Git thus calls up the Git at origin
and has them list out (some of) their commit hash IDs. If your Git has the same IDs, you and they have the same commits and you're in sync. If not, one of you has some commits that the other doesn't, and/or vice versa. Git is generally quite greedy for commits, so at this point one Git—the one with extra commits—will give commits to the other Git, that the other database lacks. The receiving Git will add those commits to its collection, which will add them to its collection, rather Borg-like. ("We will add your biological and technological distinctiveness to our own.")
The numbering system has a few consequences. One is that because the numbers are cryptographic digests, they're quite random-looking, and inhospitable to humans. Nobody can remember all the hash IDs. Fortunately we don't have to do that: the computer can do that, and the computer is good at that. The other is that because the commit's hash ID is a cryptographic checksum of the commit's content, no part of any commit can ever be changed.
The simple part of Git
Every Git commit has two parts:
Each commit stores a full snapshot of every file. The files inside a commit are stored in a special, read-only, compressed and—important for various reasons—de-duplicated format. Because all parts of every commit are read-only, it's safe for any commit to share any of its file content with any other commit (or even other parts of the same commit). So no matter what you do with commits—e.g., add a million identical ones—you won't bloat up the repository with duplicated files, even though every commit stores every file in a logical sense.
This snapshot aspect of a commit means that it's easy to get every version of the stuff you've ever committed: just find the right hash ID of that commit and there are all your files, exactly as they were at the time you committed them. So everything is saved for all time, or at least, for as long as you can find the commits' hash IDs.
Separately from the snapshot, each commit stores metadata: information such as who made the commit—name and email address—and when, or why they made the commit (their log message: the meaningfulness of this depends on the human, so not every commit message has a good "why" in it).
Now we get to the sneaky tricks in Git. These are not complicated—not yet anyway—but they're the first key to understanding branching. In the metadata for any commit, Git stores a list of previous or parent commit hash IDs. This list is usually exactly one entry long, giving each commit a single parent hash ID. This kind of commit is an ordinary commit, the kind you make every day, and when we lay them out next to each other in the order you make them, with the latest at the right:
... <-F <-G <-H
we get a simple backwards-looking chain. Here H
stands in for the real hash ID of the latest commit you just made. It has a snapshot and metadata, and in its metadata, commit H
stores the raw hash ID of earlier commit G
. Because Git has a simple key-value store, in which it can look up G
's hash ID and obtain commit G
, Git can actually work with both commit H
and commit G
"at the same time", as it were. We just have to give Git the hash ID of commit H
.
Commit G
, though, is an ordinary commit: it has a snapshot and metadata, and in its metadata, commit G
stores the raw hash ID of earlier commit F
. So Git can look up the actual commit itself, using just the hash ID of G
to find G
's metadata to find F
's hash ID. So now Git has the G
-and-F
pair.
In other words, starting from H
, Git was able to move back one to G
, and from there, Git was able to move back one step again to F
. Commit F
is of course also an ordinary commit, with one parent, so Git can now move back one more step. Git can repeat this forever, or at least, until it gets back to the very first commit. This first commit can't point backwards, so it just doesn't:
A <-B ... <-G <-H
and if we have Git start at H
and work backwards one hop at a time, Git eventually reaches commit A
and stops there.
This is the history in the repository. The commits contain the snapshots; every commit stores every file (with de-duplication); and by moving backwards, one commit at a time, Git finds every commit in this simple linear chain. There's one big hitch though: we have to give Git the hash ID of commit H
. How do we find that?
Branch and other names
This is where branch names enter the picture. In Git, a branch name—or any other name, for that matter—just contains one hash ID. Assuming that's a commit hash ID,1 that gives us—or Git—a last commit to start from. From there, Git can work backwards. Since commits point backwards to their parents, and that's the history in the repository, this is how Git finds history.
Note that if we have more than one branch name, we can have more than one "last commit". To illustrate that, suppose we have a chain of commits that ends at commit H
:
...--G--H <-- main
We now create two more branch names, such as br1
and br2
, both of which also point to H
at the moment:
...--G--H <-- br1, br2, main
All the commits are on all three branches at this point. But as we make new commits, Git will move one (and only one) branch name "forward" while we do that. If we start with br1
and make a new commit I
, it will point back to existing commit H
and drag the name br1
forward:
I <-- br1
/
...--G--H <-- br2, main
When we make a second new commit we get:
I--J <-- br1
/
...--G--H <-- br2, main
The name br1
points to J
; J
points backwards to I
; I
points backwards to H
; and so on. So by starting at br1
, Git will find all the commits. Starting at br2
or main
, Git will find only the commits that end at H
. We have two branches—or is it three branches? That depends on what we mean by the word branch, doesn't it?
Anyway, suppose we now switch to using the name br2
and make two more commits. Now we'll have:
I--J <-- br1
/
...--G--H <-- main
\
K--L <-- br2
We now seem to have three branches. We definitely have three branch names. Which branch(es) contain commits up through H
? Git's answer is: all of them. In fact, we can safely delete the name main
at this point, if we don't care to find H
directly:
I--J <-- br1
/
...--G--H
\
K--L <-- br2
Now we only have two branches. We still have exactly the same commits though. The branches don't matter! It's only the commits that matter.
That doesn't mean that the branch names are useless, of course. We—and Git—use them to find last commits. If we want to find any particular given "last" commit quickly—e.g., if we want to count H
as a "last" commit, even though it's also an intermediate commit—we'll need a name for it.
1Non-branch-names can sometimes contain non-commit hash IDs. This is mainly a feature for making tags more useful. Branch names must always hold commit hash IDs.
The complicated parts of Git
I said above that commits are read-only snapshots with metadata. This is true: the files inside a commit literally cannot be changed, and furthermore, they are in a format that other (non-Git) programs cannot even read. You literally can't do any work with these files! But we need to get work done, so how do we do that?
Git's answer is: you don't work on, or with, the committed files. Instead, Git extracts a commit, into a work area, where you actually do your work. What this means is that, literally, the files you work with in Git, are not in Git. They get copied out of Git and you work on and with the copies.
When you go to make a new commit, you still don't use these files directly. Instead, Git has stored what amount to copies of these files,2 ready to go into a new commit. This extra copy of each file occupies something for which Git has three names: the index, the staging area, or (rarely these days) the cache. All three names refer to this same thing, which I like to describe as your proposed next commit.
This explains what git checkout
or git switch
is doing. When we use either of these commands and give it a branch name, we're really picking the commit we'd like to extract. For instance, if we have:
I--J <-- br1
/
...--G--H <-- develop, main
\
K--L <-- br2
and we run:
git switch main
we are telling Git that we want to start working on / with the files that are in commit H
. Git should now:
- erase, from our work area and proposed next commit, the current files that are there from a previous checkout;
- install, into our work area and proposed next commit, the files from commit
H
.
To remember which branch name we're using, we'll update our drawing like this:
I--J <-- br1
/
...--G--H <-- develop, main (HEAD)
\
K--L <-- br2
The special name HEAD
, in all uppercase, is attached to just one branch name. That's the branch name of the branch we are "on". So if we were on br1
:
I--J <-- br1 (HEAD)
/
...--G--H <-- develop, main
\
K--L <-- br2
and are now on main
, Git has swapped out all the commit-J
files for all the commit-H
files.
There are some special cases here. Sometimes Git doesn't have to switch out the files. Suppose that we were on develop
, which means "commit H
", when we ran git switch main
to switch to commit H
. We're telling Git to switch from H
, to H
. That's not really much of a switch, is it? In this case Git doesn't have to change out any files, and so it just doesn't bother.
This case becomes important if we don't have a develop
yet. Suppose we're on main
:
I--J <-- br1
/
...--G--H <-- main (HEAD)
\
K--L <-- br2
and we start changing a bunch of files. Then we realize: Hey, wait, I meant to do this work on a new branch. We can create a new branch and switch to it right now, and as long as the new branch also means "commit H
" right now, that switch is totally free, because Git won't need to swap out any files. We can leave our partially-completed work just partially-completed, creating a new branch name test2
for instance:
I--J <-- br1
/
...--G--H <-- main, test2 (HEAD)
\
K--L <-- br2
If and when we eventually make a new commit—let's call it N
—we'll get:
I--J <-- br1
/
...--G--H <-- main
\__
\ `--N <-- test2 (HEAD)
\
K--L <-- br2
New commit N
will point back to old commit H
as its parent, because we made commit N
from commit H
.
This index or staging area—the extra copies of each file that make up the proposed next commit—explain why you have to run git add
. When you do run git add
, Git will:
- read the working tree copy;
- compress it into Git's internal format; and
- check for any existing (duplicate) copy.
If there's some existing copy, Git can discard the compressed version it just made, and use the duplicate. If not, Git will arrange for the new compressed version to go into the repository if and when we finally do commit it.3
Although this is a bit complicated, the parts to memorize aren't that bad:
- You don't work on committed files. You work on copies of them. Git extracts the copies from some existing commit.
- The files you do work on are not in Git. They're copies.
- Until you run
git add
on them, Git doesn't even care if files have been updated. You should run git status
often enough to see which files you haven't yet git add
ed.
- The
git add
step means make the index / staging copy match the working tree copy. That is, it updates your proposed next commit.
- When you run
git commit
, Git makes the new commit's snapshot from the proposed next commit. This is why you have to git add
: to update the proposed next commit, so that git commit
will commit that.
This also leads to a proper understanding of git status
—but we'll come back to that in a moment.
2The "copies" in Git's index or staging area are already de-duplicated, and remain that way at all times, so unless you've altered a file and run git add
, these copies take no space. Technically, what's in the index is really just the file's name, mode, hash ID, and cache data, plus a slot number used during merging; we won't cover this at all.
3Technically, Git adds a new blob object immediately. If we end up not committing it after all, Git will eventually clean it up, providing we don't wind up doing another git add
and git commit
that does eventually commit it. So if you have a very big file—say, a few dozen terabytes or petabytes—you probably don't want to git add
it unless and until it's really necessary. For small files, though, it usually doesn't matter.
Snapshots vs diffs
I keep coming back to the concept of snapshots, because commits are snapshots (plus metadata). But if we look at a commit with, say, git show
or git log -p
, we don't see a snapshot. Instead, we see a diff:
$ git show | head -25 | sed 's/@/ /'
commit f01e51a7cfd75131b7266131b1f7540ce0a8e5c1
Author: Junio C Hamano <gitster pobox.com>
Date: Mon Mar 21 14:18:51 2022 -0700
The thirteenth batch
Signed-off-by: Junio C Hamano <gitster pobox.com>
diff --git a/Documentation/RelNotes/2.36.0.txt b/Documentation/RelNotes/2.36.0.txt
index d67727baa1..f1449eb926 100644
--- a/Documentation/RelNotes/2.36.0.txt
+++ b/Documentation/RelNotes/2.36.0.txt
@ -74,6 +74,10 @@ UI, Workflows & Features
refs involved, takes long time renaming them. The command has been
taught to show progress bar while making the user wait.
+ * Bundle file format gets extended to allow a partial bundle,
+ filtered by similar criteria you would give when making a
+ partial/lazy clone.
+
Performance, Internal Implementation, Development Support etc.
@ -132,6 +136,12 @@ Performance, Internal Implementation, Development Support etc.
The things with the @
s in them are diff hunks, and before this we get a diff header:
diff --git a/Documentation/RelNotes/2.36.0.txt b/Documentation/RelNotes/2.36.0.txt
index d67727baa1..f1449eb926 100644
--- a/Documentation/RelNotes/2.36.0.txt
+++ b/Documentation/RelNotes/2.36.0.txt
What Git has done is to take commit f01e51a7cfd75131b7266131b1f7540ce0a8e5c1
, use its metadata to find its parent bc3838b310b32081d48393ba0dcf26e4735c6d19
, and extracted the file Documentation/RelNotes/2.36.0.txt
from both commits. On the "left" (as a/
), Git puts the earlier version of the file; on the "right" (as b/
), Git puts the later version of the file. Then Git plays a game of Spot the Difference. The first difference Git saw was that Junio added four lines around line 77. The diff shows the added lines, plus a bit of context, then moves on the next change that Git found, which is to add more lines around line 135 (in the old version) or 139 (in the new one).
In other words, Git uses the metadata in the commit to find the (single) parent. This gives us two snapshots, which Git can compare. But in fact, Git can make a diff from any two snapshots, not just ones that are right next to each other:
...--E--F--G--H <-- somebranch (HEAD)
Here git show
will compare G
and H
, as those are the two adjacent commits, but we can run:
git diff <hash-of-E> HEAD
and have Git compare the snapshots in E
and H
directly, and show that as a diff. This all works because every commit holds a full snapshot, and Git can easily compare any two snapshots. In fact, due to the internal de-duplication, Git can compare two snapshots very quickly as long as most of the files are duplicates: it only has to look at those files that aren't duplicates. So overall, this is quite easy for Git.
Merging
This all leads us to git merge
, which is where Git gets much of its real power. Let's go back to this setup again:
I--J <-- br1 (HEAD)
/
...--G--H
\
K--L <-- br2
This tells us that we're "on" branch br1
–that is, we did a git checkout br1
or a git switch br1
—and that we're using the files from commit J
. Let's also say that we haven't touched any of these files (so that the index and working tree copies all match the commit-J
copies). We now run:
git merge br2
Our goal here is to combine changes. That is, we want to take any work we, or someone else, did on our br1
branch, and any work we or anyone else did on the br2
branch too, and combine the work.
We just saw that Git doesn't store changes. But we also saw that Git can easily compare any two commits. How will we get Git to combine changes? We have to do some diff-ing.
We could compare the snapshot in J
to the one in L
, but that doesn't really get us what we want. The trick here is to use the metadata a little differently. Commit J
has parent I
, and commit I
has parent H
, which has parent G
, and so on, backwards. Meanwhile commit L
has parent K
, which has parent H
, and that goes back to G
, and so on. Some of these parents are shared. In fact, as soon as we get back to H
, every parent from there backwards is shared. That means commit H
, which is on both branches, is the best shared parent. Git calls this "best" shared parent the merge base.4
By using this best common ancestor, or merge base, Git can run two git diff
s:
git diff --find-renames <hash-of-H> <hash-of-J> # what we changed
git diff --find-renames <hash-of-H> <hash-of-L> # what they changed
These two diffs apply to the same snapshot—the one in commit H
—and now Git can combine the diffs. As long as we touched files they didn't and they touched files we didn't, that's easy. When we and they touched the same files, Git's rules here are simple: if we didn't touch the same lines, and our changes don't butt up against each other, Git takes both changes. If we did touch the same lines, Git requires that we make the same change to those same lines, and then Git takes one of those changes. If our changes can't be combined, Git calls that a merge conflict.
This is what merge conflicts are about. Git has picked some merge base commit, and has diff-ed its snapshot against two other commits' snapshots. Git is now trying to combine changes. Git has encountered a case where its simple, line-based, text-oriented rules don't have a simple answer for how to combine these, so Git says "conflict".
Note: Git can also detect files that were all-new, or removed entirely, or renamed. This produces a different kind of merge conflict—some call it a tree conflict; I call it a high level conflict—that doesn't involve particular lines within a file, but rather some entire thing to do with that file. For instance, suppose we added some functions to subroutines.py
and they deleted subroutines.py
entirely. Git has no idea how to combine "add these lines" with "delete this file", so it will call that a modify/delete conflict.
In all these conflict cases, Git dumps the job of resolving the conflict onto the human, who presumably understands the file's contents. The human doesn't just apply simple text-substitution rules. The human knows whether changing red ball
to blue ball
on one side of the merge, and red ball
to red cube
on the other side, should result in blue cube
, or maybe in green pyramid
or whatever.
But if there isn't a conflict—if the merge goes smoothly—Git will take the combined changes, whatever those wind up being, and apply them to the base snapshot. That is, given:
I--J <-- br1 (HEAD)
/
...--G--H
\
K--L <-- br2
Git combines our H
-vs-J
changes with their H
-vs-L
changes and applies both changes to H
. That keeps our work and adds theirs, or keeps their work and adds ours, however you'd like to look at it. Then Git makes a new commit from this result, and this new commit is special in exactly one way:
I--J
/ \
...--G--H M <-- br1 (HEAD)
\ /
K--L <-- br2
Commit M
is a merge commit. It has a snapshot, just like any commit. It has metadata, just like any commit. What's special about it is that instead of one parent, it has two. Commit M
points back to existing commit J
, in the way a new commit does. But commit M
also points back to merged commit L
.
This—the two parents—is what makes commit M
a merge commit. Of course the snapshot, in this case, is the result of merging changes as well, but that's not what makes M
a merge commit. It's the two parents that make M
a merge commit.
Note that, as usual, Git has updated the current branch name to point to the new commit. So br1
now means "commit M
", not "commit J
". No commits have changed—no commits can ever change—but the branch name has moved as usual.
What's unusual is that because M
points back to L
as well as to J
, we may no longer care about finding commit L
with a branch name. It's now safe to delete the name br2
:
I--J
/ \
...--G--H M <-- br1 (HEAD)
\ /
K--L
because we can still find all the commits by starting at M
and working backwards. It's trickier now, because when we step back once from M
, we have to visit both commits J
and L
. Then we have to visit both I
and K
, and then we visit H
once, and then step back to G
, and so on. Git knows how to do this, but it is tricky, and this is one of the harder things to understand in Git. Peculiarly, the split—where the two branches fork off from H
initially—is actually easier, and the merge at M
, where the two branches come together, is hard. That's because Git works backwards, and when we work backwards, it's the merges that split, and the splits that merge, as it were. But remember that commits hold snapshots and metadata and you'll be fine: the merge holds a snapshot. Preparing the snapshot might have been hard, and figuring out why it's that snapshot, despite what's in the parent commits J
and L
might be hard, but it's still just a snapshot.
Do note, though, that the "show a commit as a patch" trick that git log -p
uses for ordinary commits stops working here. When we have:
...--G--H <-- branch (HEAD)
Git will compare the G
and H
snapshots to show a diff, but when we have:
...--J
\
M <-- branch (HEAD)
/
...--L
which snapshot should Git compare to the snapshot in M
? The answer git log
uses by default is this is too hard, so I won't bother showing anything at all. That's not a very good answer, but be aware of it. (There is no single right answer to the dilemma, but it might be nice if git log -p
inserted something to indicate it didn't bother to do any work here.)
4Technically, the merge base of two commits is found using the Lowest Common Ancestor algorithm on the DAG formed by the commits. Sometimes there's more than one LCA, and this complicates merging, but we'll ignore this case entirely here.