First, note that the terms index and staging area mean the same thing. There is also a third term, cache, that now mostly appears in flags (git rm --cached
for instance). These all refer to the same underlying entity.
Next, whlie it's often convenient to think in terms of changes, this will eventually mislead you, unless you keep this firmly in mind: Git does not store changes, but rather snapshots. We only see changes when we compare two snapshots. We put them side by side, as if we're playing a game of Spot the Difference—or more precisely, we have Git place them side by side and compare them and tell us what's different. So now we see what's changed, between these two snapshots. But Git doesn't have those changes. It has the two snapshots, and is merely comparing them.
Now we get to the really tricky part. We know that:
each commit has a unique hash ID, which is how Git finds that particular commit;
each commit stores two things:
- it has a complete snapshot of every file Git knew about as of the time you, or whoever, made the snapshot; and
- it has some metadata, including the name and email address of whoever made the commit, some date-and-time-stamps, and so on—and importantly for Git, it has the raw hash ID of some earlier commit(s), so that Git can move back in time, from each commit to its parent;
and all parts of any commit are frozen in time forever.
So commits store snapshots, which Git can extract for us to work on. But Git doesn't just extract the commit into a working area. Other version control systems do: they have the commits and the working tree, and that's all there is, and all you need to know about. The committed version is frozen for all time, and the usable version is usable, and changeable. That's two "active" versions and gives us a way to see what we've changed: just compare the active but frozen snapshot to the working one.
But for whatever reason, Git doesn't do that. Instead, Git has three active versions. One active version is frozen for all time, just like always. One active version is in your working tree, just like always. But stuffed in between these two versions, there's a third snapshot. It's changeable, but it's otherwise more like the frozen copy than it is like the useful copy.
This third copy of each file, sitting between the frozen commit and the usable copy, is Git's index, or at least, the part of Git's index you get to worry about.1 You need to know about Git's index, because it acts as your proposed next commit.
That is, when you run:
git commit
what Git will do is:
- gather the appropriate metadata, including the hash ID of the current commit;
- make a new (though not necessarily unique2) snapshot;
- use the snapshot and metadata to make a new, unique commit;3
- write the new commit's hash ID into the current branch name.
The last step here adds the new commit to the current branch. The snapshot, in step 2 above, is whatever is in Git's index at this time. So before you run git commit
, you have to update Git's index. This is why Git makes you run git add
, even for files that Git already knows about: you're not exactly adding the file. Instead, you're overwriting the index copy.
1The rest of it is Git's cache, which normally doesn't get all up in your face. You can use Git without knowing about the cache aspect. It's difficult—maybe impossible—to use Git well without knowing about the index.
2If you make a commit, then revert it, the second commit re-uses the snapshot that you had before you made the first commit, for instance. It's not at all abnormal to wind up re-using old snapshots.
3Unlike source snapshots, each commit is always unique. One way to see why this is the case is that each commit gets a date-and-time. You'd have to make multiple commits in a single second to risk any of them getting the same timestamp. Even then, those commits would presumably have different snapshots and/or different parent commit hash IDs, which would keep them different. The only way to get the same hash ID is to commit the same source, by the same person, after the same previous commit, at the same time.4
4Or, you could get a hash ID collision, but that never actually happens. See also How does the newly found SHA-1 collision affect Git?
A picture
Let's draw a picture of some commits. Instead of hash IDs, let's use uppercase letters. We'll have a simple chain of commits along the main-line branch, with no other branches yet:
... <-F <-G <-H
Here, H
stands in for the hash ID of the last commit in the chain. Commit H
has both snapshot (saved from Git's index whenever you, or whoever, made commit H
) and metadata (name of person who made H
, etc). In the metadata, commit H
stores earlier commit G
's raw hash ID. So we say that H
points to G
.
Commit G
, of course, also has both a snapshot and metadata. That metadata makes earlier commit G
point back to still-earlier commit F
. Commit F
in turn points back still further.
This repeats all the way to the very first commit ever. Being first, it doesn't point back, because it can't; so Git can stop here. Git just needs to be able to find the last commit. Git needs its hash ID. You could type it in yourself, but that would be painful. You could store it in a file somewhere, but that would be annoying. You could have Git store it for you, and that would be convenient—and that's just what a branch name is and does for you:
...--F--G--H <-- main
The name main
simply holds the one hash ID, of the last commit in the chain.
This is true no matter how many names and commits we have: each name holds the hash ID of some actual, valid commit. Let's make a new name, feature
, that also points to H
, like this:
...--F--G--H <-- feature, main
Now we need a way to know which name we're using. Git attaches the special name HEAD
to one of the branch names, like this:
...--F--G--H <-- feature, main (HEAD)
We're now "on" main
, and using commit H
. Let's use git switch
or git checkout
to switch to the name feature
:
...--F--G--H <-- feature (HEAD), main
Nothing else has changed: we're still using commit H
. But we're using it because of the name feature
.
If we make a new commit—let's call it commit I
—commit I
will point back to commit H
, and Git will write commit I
's hash ID into the current name. This will produce:
...--F--G--H <-- main
\
I <-- feature (HEAD)
Now if we git checkout main
, Git has to swap out our working tree contents and our proposed-next-commit contents. So git checkout main
will flip both Git's index and our working-tree contents around so that they match commit H
. After that, git checkout feature
will flip them back so that they both match commit I
.
If we make a new commit J
on feature
, we get:
...--F--G--H <-- main
\
I--J <-- feature (HEAD)
The reset
command: it's complicated!
The git reset
command is complicated.5 We'll only look at "whole commit" reset varieties of the command here—the ones that take --hard
, --soft
, and --mixed
options—and not the ones that mostly do things that we can now do with git restore
in Git 2.23 and later.
These "whole commit" reset operations take a general form:
git reset [<mode-flag>] [<commit>]
The mode-flag
is one of --soft
, --mixed
, or --hard
.6 The commit
specifier—which can be a raw hash ID directly, or anything else that can be converted to a commit hash ID, by feeding it to git rev-parse
—tells us which commit we'll move to.
The command does three things, except that you can have it stop early:
First, it moves the branch name to which HEAD
is attached.7 It does this by just writing a new hash ID into the branch name.
Second, it replaces what's in Git's index with what's in the commit you selected.
Third and last, it replaces what's in your work-tree with what it's replacing in Git's index too.
The first part—moving HEAD
—always happens, but if you pick the current commit as the new hash ID, the "move" is from where you are, to where you are: kind of pointless. This only makes sense if you're having the command go on to steps 2 and 3, or at least to step 2. But it does always happen.
The default for the commit
is the current commit. That is, if you don't pick a new commit, git reset
will pick the current commit as the place to move HEAD
. So if you don't pick a new commit, you're making step 1 do the "stay in place" kind of move. That's fine, as long as you don't make it stop there: if you make git reset
stop after step 1, and make it stay in place, you're doing a lot of work to accomplish nothing at all. That's not really wrong, but it is a waste of time.
So, now let's look at the flags:
--soft
tells git reset
: do the move, but then stop there. Whatever is in Git's index before the move is still in Git's index afterward. Whatever is in your working tree remains untouched.
--mixed
tells git reset
: do the move and then overwrite your index, but leave my working tree alone.
--hard
tells git reset
: do the move, then overwrite both your index and my working tree.
So, let's say we start with this:
...--F--G--H <-- main
\
I--J <-- feature (HEAD)
and pick commit I
as the place that git reset
should move feature
, so that we end up with:
...--F--G--H <-- main
\
I <-- feature (HEAD)
\
J
Note how commit J
still exists, but we can't find it unless we've saved the hash ID somewhere. We could save J
's hash ID on paper, on a whiteboard, in a file, in another branch name, in a tag name, or whatever. Anything that lets us type it in or cut-and-paste it or whatever will do. We can then make a new name that finds J
. We could do this before we do the git reset
, e.g.:
git branch save
git reset --mixed <hash-of-I>
would get us:
...--F--G--H <-- main
\
I <-- feature (HEAD)
\
J <-- save
where the name save
retains J
's hash ID.
The --mixed
, if we use it here, tells Git: don't touch my work-tree files at all! This doesn't mean you'll have, in your work-tree, the exact same files that are in commit J
, because maybe you were fiddling with those work-tree files just before you did the git reset
. The --mixed
means that Git will overwrite its files, in Git's index, with the files from I
. But Git won't touch your files here. Only with --hard
will git reset
touch your files.
(Of course, if you run git checkout
or git switch
: well, those commands are supposed to touch your files too, so that gets more complicated again. But don't worry about that right now, as we're concentrating on git reset
.)
5I personally think that git reset
is too complicated, the way git checkout
was. Git 2.23 split the old git checkout
into git switch
and git restore
. I think git reset
should be similarly split up. But it isn't yet, so there is not much point complaining, other than to write this footnote.
6There are also --merge
and --keep
modes, but they're just further complications that I intend to ignore as well.
7In detached HEAD mode, which I'm ignoring here, it just writes a new hash ID into HEAD
directly.
Summary
The default for git reset
is to leave your files alone (--mixed
). You can also tell Git to leave its own index alone, with --soft
: this is sometimes useful when you want to make a new commit that uses what's in Git's index. Suppose you have:
...--G--H <-- main
\
I--J--K--L--M--N--O--P--Q--R <-- feature (HEAD)
where commits I
through Q
are all just various experiments, and your last commit—commit R
—has everything in its final shape.
Suppose, then, that you wish to make a new commit that uses the snapshot from R
, but comes after commit I
, and you want to call that the last commit on your (updated) feature
. You could do this with:
git checkout feature # if necessary - if you're not already there
git status # make sure commit R is healthy, etc
git reset --soft main # move the branch name but leave everything else
git commit
Right after the git reset
, we have this picture:
...--G--H <-- feature (HEAD), main
\
I--J--K--L--M--N--O--P--Q--R ???
It's now hard to find commits I
through R
at all. But the right files are in Git's index now, ready to be committed, so the git commit
makes a new commit that we can call S
(for "squash"):
S <-- feature (HEAD)
/
...--G--H <-- main
\
I--J--K--L--M--N--O--P--Q--R ???
If you were to compare the snapshot in R
to that in S
, they would be the same. (Here's another case where Git would just re-use the existing snapshot.) But since we can't see commits I-J-...-R
, it now seems as though we've magically squashed all the commits together into one:
S <-- feature (HEAD)
/
...--G--H <-- main
Comparing S
to its parent H
, we see all the same changes as we'd see if we compared H
vs R
. If we never see I-J-...-R
again, that's probably just fine!
So git reset --soft
is convenient because we get to move a branch name and preserve everything in both Git's index and our work-tree.
In some other cases, we might want to make, say, two commits out of the files that were in R
. Here we could let --mixed
reset Git's index:
git reset main
git add <subset-of-files>
git commit
git add <rest-of-files>
git commit
This would give us:
S--T <-- feature (HEAD)
/
...--G--H <-- main
where the snapshot in T
matches that in R
, and the snapshot in S
has just a few changed files. Here, we use the --mixed
mode of reset to keep all files intact in our work-tree but reset Git's index. Then we use git add
to update Git's index to match part of our work-tree, commit once to make S
, and use git add
to update the rest of our work-tree and commit again to make T
.
So all of these modes have their uses, but to understand those uses, you need to understand what Git is doing with Git's index and your work-tree.