Note: before or after reading the text below (I recommend after), you may also want to look at Checkout another branch when there are uncommitted changes on the current branch.
What Git really does is save snapshots. That's almost all there is to it:
$ git init # create empty repository: no commits exist yet
Then, repeatedly:
... do some work ...
$ git add <files> # copy the work into the index
$ git commit # turn everything that is in the index, into a snapshot
Each git commit
packages up whatever is in the index (aka staging area aka cache) right now and turns that into a snapshot, which is permanent—well, mostly permanent—and completely read-only.
We will come back to all of this in a bit.
Commits, hash IDs, and branch names
Except for the very first commit, you always make a new snapshot while sitting on an existing snapshot. The new snapshot gets a commit hash ID—some apparently-random string of hexadecimal numbers, like b7bd9486b055c3f967a870311e704e3bb0654e4f
. This is the true name of the commit: it's how Git can use the commit to obtain the snapshot. That lets you, some time in the future, find out what you saved now.
Each commit also records the hash ID of the commit that was the existing snapshot at the time. If we use single uppercase letters, which as mere humans we can comprehend, instead of the big ugly hash IDs, we can call that very first snapshot A
. The second snapshot is therefore B
and saves the actual hash ID of A
inside it. We say that B
points to A
:
A <--B
When we make our third snapshot C
, we do that while sitting on B
, so C
points to B
:
A <-B <-C
What we—and Git—need to know, then, is what's the latest snapshot? That's what a branch name is really about: a branch name, like master
, records the last snapshot. If the latest is C
, we have:
A--B--C <-- master
If we make a new commit D
, the name master
now needs to remember D
. D
will point back to C
; master
does not need to remember C
any more, because D
will:
A--B--C--D <-- master
The arrows within commits always point backwards, from child to parent, and since nothing—not Git itself—can change anything inside any existing commit, we don't really need to draw them. But branch name arrows do change over time, so we should keep drawing them.
Now, suppose we make a new branch name like dev
at this point. The name dev
will record some commit ID. It could record any of the four, but the default is to make it using the current commit ID, which is the one master
holds, giving us this:
A--B--C--D <-- dev, master
Now that we have two branch names, we need to know: which branch name are we using? This is where HEAD
comes in: we attach the word HEAD to one of these names. That's our current branch, whose commit ID is stored in the branch name, so if we are on dev
, the picture is really:
A--B--C--D <-- dev (HEAD), master
Now if we make a new commit E
, E
will point back to D
, and Git will update the current name (dev
) to point to E
:
A--B--C--D <-- master
\
E <-- dev (HEAD)
If we now run git checkout master
and make a new commit F
, F
will point back, not to E
, but to D
—that's the one master
points to—and Git will update master
to point to F
:
A--B--C--D--F <-- master (HEAD)
\
E <-- dev
That's it: that's all that a branch name is and does! It just records the latest commit, which Git calls the tip commit. The good stuff is all in the commits: each commit is a complete snapshot of everything that was in the index.
The index and the work-tree
All the files that are inside a commit are in a special, Git-only, compressed form (often highly compressed, at least for source text files). Git is pretty much the only program that can read them or do anything with them.1 So Git needs a way that you and your computer can read and write to ordinary-format files. Those files go into your work-tree, so-called because here, you can work with them.
Git has, however, an intermediate form for all the files. It takes those compressed, Git-only, read-only files and copies them—well, stuff about them, really—into something Git calls the index. Here, the files are still compressed in a Git-only form, but here, they can be overwritten. It also uses this index to keep track of—to index and cache, hence those names—information about the work-tree files. This is where Git gets most of its speed. There are similar VCSes that don't have an index, proving that it's unnecessary in a theoretical sense, but they are slower (sometimes hugely slower) than Git.
Having provided this index, Git forces you to use the index, even if you don't really want to. Instead of copying files straight from a commit to the work-tree, it copies files from the commit, into the index first, and only then expands them out to normal form in the work-tree. This is why Git makes you run git add
every time: what git add
does is to copy the file from the work-tree, into the index (compressing it into Git format in the process).
This is how it is that git commit
is so fast, compared to other VCSes: Git can just take whatever is in the index right now, package it into a commit, and be done. All the hard work of compressing files is already done! Git does not even have to look at the work-tree.
This also means that after git commit
, the new commit you just made, matches the index. Hence, after git checkout branch
, the index matches the tip commit of branch
, because Git copied the commit to the index while updating the work-tree. After git commit
changes branch
to have a new tip commit, the index matches the (new) tip commit of branch
, because Git copied the index—froze it into a snapshot—to make the commit.
1Nothing can change them: this is a design feature; the actual contents of everything are stored under a crytpographic checksum hash ID. (This is where the hash IDs actually come from. The hash ID is exquisitely sensitive to every single bit, so if you were to change something—accidentally, like a disk error, or on purpose by overwriting it—Git would detect that the object's checksum no longer matches the checksum-key used to retrieve the object. That's why everything, once committed, is read-only.
Commits can be forgotten about, on purpose. Doing so is sometimes tricky, and they will very easily get restored: Git is mainly designed to add things, not to remove them, and is much more willing to add new things than it is to forget old ones. We won't cover this in any detail here.
"But commits look like diffs!"
If you run:
git show <commit>
or:
git log -p
you will see each commit shown as a patch. Git can do this because each commit stores its previous commit—its parent—inside the commit. Git simply extracts both snapshots and compares them. Whatever is different, gets shown.
(There is a complication here at merge commits, but we'll just ignore that, too.)
Revert
What revert does can now be described very simply:2 Git turns the commit into a patch, then reverse applies the patch to some other commit.
That is, if the commit-as-patch says "add a line to file A", Git removes that line from that file. If the commit-as-patch says "remove a line from file B", Git adds that line to that file.
Having reverse-applied the commit to the current commit (through the work-tree and using the index that matches the current commit), Git copies the updated files into the index as if by git add
, then makes a new commit, automatically supplying the commit log message. You can override some of these with various flags, and there are complications (see footnote 2) when the patch doesn't apply properly. But that's mostly it.
2This is actually too simple. Revert really invokes Git's three-way merge machinery (as does git cherry-pick
). In simple, unconflicted cases, however, "apply a patch and commit" (cherry-pick) or "reverse-apply a patch and commit" (revert) suffice to describe the process.
Revert is a poor name for this process
Mercurial (which is otherwise a lot like Git, only slower and more user-friendly) calls this hg backout
rather than hg revert
, because it backs out the changes of a commit. The verb revert, often with the auxiliary word to as in revert to, means—at least to some people—to change the entire contents back. That is, instead of saying:
"commit a123456 changed one line of file README.txt and I want that one line changed back"
people sometimes mean:
"README.txt has been changed a lot since commit a123456, and I want the version that was in a123456 back, so that means I want _____"
and they fill in the blank with "to revert README.txt to a123456" and thus they reach for git revert
.
That's not what git revert
does. To do that, one needs to extract the file README.txt
from commit a123456
. Confusingly, the main Git command that does this is git checkout
, using a different syntax from git checkout branch
. (It should have been a separate command, and in Mercurial it is: it is hg revert
!) If you want this in Git, you can write:
git checkout a123456 -- README.txt
which copies README.txt
from commit a123456
into the index (as usual), then expands it into normal, not-Git-only, format into your work-tree as file README.txt
.
Note that in all modern versions of Git, you can also use:
git show a123456:README.txt
which displays the contents of that file, as of that commit, on your screen, and generally works with redirection, so that you can save it to a file inside or outside of your work-tree:
git show a123456:README.txt > restored-readme
for instance. This does not affect the index.