understanding git reset effect on index

Question

I'm having a small conflict when reading documentations/tutorials about git reset: For git reset --mixed for example, the documentation says:

The next thing reset will do is to update the Index with the contents of whatever snapshot HEAD now points to

What is causing my conflict is the fact that I'm expecting clear the index instead of update the index. Is the index cleared or updated with whatever snapshot HEAD now points to?

Possible duplicate of [What's the difference between git reset --mixed, --soft, and --hard?](https://stackoverflow.com/questions/3528245/whats-the-difference-between-git-reset-mixed-soft-and-hard) — Eelke, Dec 02 '18 at 15:32
Index does not store changes. It store full snapshot of working directory. **Clear the index** would mean to stage removal of all files. Why are you expecting that? — user4003407, Dec 02 '18 at 17:03

score 3 · Answer 1 · answered Dec 02 '18 at 19:26

TL;DR

The index is always updated. The index holds the next commit you intend to make, so it's never empty. (What, never? Well, hardly ever: it's empty in a new repository you just created, which has no files, and would commit nothing if you ran git commit right now. It's also empty if you git rm everything.)

Long

Your confusion here is almost certainly related to the comment PetSerAl made. Those new to Git are often told or shown or at least led to believe that commits and/or Git's index contain changes, but this is false! Once you rid yourself of this incorrect belief, some of the mysteries of Git begin to make more sense. (Not all of Git makes sense to anyone, even me. So don't worry if it takes a long time to grok Git.)

In Git, a commit contains a complete snapshot of all of your files. It also contains some metadata—information about the commit itself, such as your name, email address, and a timestamp. Included in the metadata is the hash ID of the commit's parent commit—or, for a merge commit, multiple parents, plural—and it's by comparing commits to their parents that Git shows you changes. Each commit has its own unique hash ID, such as 8858448bb49332d353febc078ce4a3abcc962efe (this is the ID of a commit in the Git repository for Git). That commit is a snapshot, but that commit has a parent (in this case, 67f673aa4a...), so Git can show you 8858448bb4... by extracting both the earlier 67f673aa4a and 8858448bb4, then comparing the two. The git show command does just that, so that what you see is what changed in 8858448bb4, rather than what is in 8858448bb4.

(It's like telling you that it's 5 degrees warmer or cooler today than yesterday, and more or less windy, instead of giving the weather as a bunch of numbers. The database stores absolutes, but mostly we want to know whether it's nicer out.)

The index stores the next commit you can make

You can see Git's commits in various ways, and of course name them by their hash IDs, as I did above. You can see your work-tree—which is where Git lets you view and edit your files—directly: there they are, on your computer, in their normal everyday form. But you can't see the index very well. It's kind of invisible. This is a problem, because it's also critical.

Most version control systems don't have an index at all, or if they have something like it, keep it so well hidden that you never have to know about it. But Git does this odd thing of forcing you to understand Git's index, while also keeping it a little bit hidden.

If you really want to see a list of the files that are in the index right now, you can use git ls-files:

$ git ls-files | head
.clang-format
.editorconfig
.gitattributes
.github/CONTRIBUTING.md
.github/PULL_REQUEST_TEMPLATE.md
.gitignore
.gitmodules
.mailmap
.travis.yml
.tsan-suppressions
$ git ls-files | wc -l
    3454

There are almost 3500 files in the index, in this Git repository for Git. That's a lot of files! This is why Git keeps it mostly-hidden: there's just too much stuff in there to comprehend.

But this is also why Git shows us commits by comparing them to their parents. Showing the whole contents of 8858448bb4 would be too much, so git show 8858448bb4 shows us what changed in 8858448bb4, vs its parent. Git takes the same tack with the index, showing us what we have changed, rather than dumping out the entire thing.

This, I think, is what makes people think that Git is storing changes. Git shows changes, so Git must be storing them ... but it's not! Git stores whole snapshots. Git figures out changes, every time you ask Git to show you something.

With that in mind, let's look at how we see the index.

The index sits between the current commit and the work-tree

We know now that each commit is a full snapshot. If Git made a new copy of every file every time we made a commit, the repository would get very large very fast. So it doesn't do that, and one part of the way it doesn't do that is really simple. While each commit is a full snapshot, the files inside every commit are completely, totally, 100% read-only. None of them can ever change. This means that each commit can share some or all of its files with some earlier commit!

Git just needs to make sure that every time we run git commit, it freezes all the file content, forever—or if not forever, for at least as long as this new commit continues to exist. So files inside each commit are frozen. They're also compressed into a special Git-only format (which works really well for text files, but often not so great for binary files like images). This compressing takes time, sometimes a lot of time, but it makes the repository stay small.

Obviously, frozen Git-only files are useful only to Git itself, so we need a copy of every file from the current commit taken out, thawed, decompressed, and made useful. These useful copies go into the work-tree, where we do our work.

Other version control systems do much the same thing. In the hypothetical XYZ Version Control system, you run xyz checkout commit and it copies the commit out of the deep-freeze warehouse, thaws it out, decompresses it, and stores it in your work-tree. You do some work, and eventually you run xyz commit. It now scans through your entire work-tree, re-compresses every file, freezes it up, and checks to see if it's already got that frozen version in the warehouse or needs to put this one in there too. Each of these steps takes many seconds or minutes while you go get coffee or whatever.

What Git does, with its index, is very clever: the index is a staging area, between the deep-freeze warehouse (the repository full of commits) and the useful form (thawed-out files in your work-tree). Initially, it contains the same files that were in the deep-freeze. They're thawed (sort of), but are still in the special Git-only form, and they are paired up with the fully-thawed, de-compressed version in your work-tree.

As you change the files in your work-tree, or add and/or remove files, the index copies get out of sync with the work-tree. Now Git can compare the index copy to the work-tree copy, and tell you what you have changed but not yet staged.

Once you have some file the way you want it, you run git add file. This re-compresses the file right then and there, into the special Git-only format, and puts that copy in the index. Now the index copy—which is a complete copy, just compressed—matches the work-tree copy, but is different from the committed copy.

At any time, you can have Git compare the committed (HEAD) copy of each file to the index copy:

git diff --cached

For files that are the same, Git says nothing. For files that are different, Git lists the file and shows you the difference.

Similarly, at any time, you can have Git compare the index copy of each file to the work-tree copy:

git diff

For files that are the same, Git says nothing. For files that are different, Git lists the file and shows you the difference.

(Note: adding --name-status has git diff show you the names of the files, prefixed with M for modified, if they're modified. Git uses A for a newly-added file, D for a deleted file, and so on. A file is deleted in the index by simply removing it from the index entirely. A file is added in the index if it's in the index but not in HEAD.)

The git status command runs both of these comparisons, with the --name-status limiter. For files that differ between HEAD and the index, these are staged for commit. For files that differ between the index and the work-tree, they are not staged for commit.

Pictorially:

   HEAD         index        work-tree
----------    ----------    ----------
README.txt    README.txt    README.txt
main.py       main.py       main.py

The HEAD copy is frozen, because it's in a commit. The index and work-tree copies can change, but initially, all three match. You change the work-tree copy and use git add to copy it back into the index, compressing and en-Git-ing it (if "en-Git-ing" is a word, which it isn't). If you didn't mean to change it in the index after all, you use git reset (with its default --mixed action, or the way it works on any single file) to copy the frozen one back into the index.

This is also why `git commit` is so fast, compared to `xyz commit`

When you run git commit, Git already has all of the files that will go in the new commit, in the right form. It does not have to re-compress all the work-tree files and see if they match the frozen committed versions. The index has all of that ready to go: all it has to do is freeze the index copy, and if that's the same as the previous commit, share the file with the previous commit.

Moreover, since the index "knows" which files match the work-tree and which don't,¹ and has extra information about what's in the repository as well, this makes git checkout faster too. Suppose you're on master with its about-3500 files, and you git checkout some other branch with about-3300 of the files all being exactly the same. About 200 files are different between the two commits (maybe a few are new or deleted as well). Git can use the index to know what it might need to touch in the work-tree, and avoid touching those about-3300 files at all.

Hence, instead of the XYZ system scanning and maybe-touching 3500 files, Git scans and maybe-touches 200 files, saving over 94% of the work.

¹This often requires a scan of the work-tree. The index keeps copies of (caches) data about the work-tree, so as to speed this up. This is why the index is sometimes called the cache. Other VCSes, such as Mercurial, have a work-tree cache (Mercurial calls this the dirstate), but unlike Git's index, it's properly hidden: you don't have to know about it.