4

I read about The Three States in Git from https://git-scm.com/book/en/v2/Getting-Started-What-is-Git%3F It says here that Git has three main states that your files can reside in: committed, modified, and staged.

Then, I also read about the two states: tracked or untracked from https://git-scm.com/book/en/v2/Git-Basics-Recording-Changes-to-the-Repository Here it says that each file in your working directory can be in one of two states: tracked or untracked. Tracked files are files that were in the last snapshot; they can be unmodified, modified, or staged.

Are the states mentioned from the The Three States similar with the sub states of tracked files? Is committed and unmodified the same?

These images shows they are the same?

The lifecycle of the status of your files

The three file states for Git: modified, staged, and commited

caramba
  • 21,963
  • 19
  • 86
  • 127
Ganuelito
  • 83
  • 6
  • Yes. You are right. `Commited` and `Unmodified` are same. – yaho cho Apr 27 '19 at 05:33
  • The Three Stages should really be Four Stages, since a single file can have both staged and unstaged changes simultaneously. – chepner Apr 27 '19 at 05:43
  • To me it sounds like tracked files can be in one of the following states: 1. unmodified / committed; 2. staged; 3. modified; – Ganuelito Apr 27 '19 at 05:56

3 Answers3

9

TL;DR

Tracked-ness is not a subset of the listed three states, and the listed three states are not sufficient to describe (or understand, really) how Git works.

Long

This "three states" thing is a bit of a white lie, which is probably why the page says:

Git has three main states

(emphasis mine). It's my opinion that the Pro Git book is doing a bit of disservice here, as I think they are trying—for some good reasons—to hide the existence of Git's index from your initial view of everything. But in the same very same paragraph, they introduce the idea of the staging area, which is really just another name for the index.

In fact, what's really going on here is that there are normally three copies of each file. One copy is in the current commit, a middle copy is in the index / staging-area, and a third copy is in your work-tree.

The middle copy—the one in the index—is not necessary, from a version-control-system design point of view. Mercurial is another version control system that is very much like Git, and it has only two copies of each file: the committed one, and the work-tree one. This system is much easier to think about and to explain. But for various reasons,1 Linus Torvalds decided that you should have a third copy, wedged in between the commit and the work-tree.

It's useful to know that committed copies of files are in a special frozen, read-only, compressed, Git-only file format (which Git calls a blob object though you don't need to know that most of the time). Because such files are frozen / read-only, Git can share them across every commit that uses the same copy of the file. This can save enormous amounts of disk space: one commit of a ten megabyte file takes up to ten megabytes (depending on compression), but make a second commit with the same file and the new copy takes zero extra bytes: it just re-uses the existing copy. No matter how many more commits you make, as long as you keep re-using the old file, it takes no more space to store the file. Git just keeps re-using the original instead.

In fact, everything about a commit is frozen forever. No part of any commit—no file, no author information, no spelling error in the log message—can ever be changed. The best you can do is make a new and improved, different commit, that fixes the spelling error or whatever. Then you can use the new and improved commit instead of the old and lousy one, but the new commit is a different commit, with a different hash ID. The hash IDs are the true names of the commits (and, for that matter, of the blob objects that go with the commit snapshot).

So commits are permanent2 and read-only. The files inside commits are compressed into a read-only, Git-only, freeze-dried format. Since commits are history, this keeps the history around forever, in case you ever want to look back at it to see what someone did, when, and why. But it's no good at all for getting any actual work done. You need files to be malleable, pliable, plastic, tractable, flexible, putty in your hands. You need to work with your files. In short, you need a work tree, where you can do your actual work.

When you git checkout a commit, Git extracts the freeze-dried copies into this work-tree. Now your files are all there where you can use them and change them. You would think that git commit would take the updated files from the work-tree and commit them—that's what Mercurial's hg commit does, for instance—but no, that's not what Git does.

Instead, Git inserts this third copy of each file in between the committed copy and the work-tree copy. This third copy, which is in the entity that Git sometimes calls the index, sometimes calls the staging area, and occasionally calls the cache—three names for one thing—is in the freeze-dried Git format, but importantly, since it's not in a commit, you can overwrite it any time. That's what git add does: it takes an ordinary file you have in your work-tree, freeze-dries it, and stuffs that into the index in place of whatever was in the index under that name before.

If the file wasn't in the index before your git add, well, now it is. And if it was in the index ... well, in either case, Git compressed the work-tree file into the appropriate freeze-dried format and stuffed that into the index, so now the index copy matches the work-tree copy. If the work-tree copy matches the committed copy (modulo any freeze-drying or rehydrating as appropriate), all three copies match. If not, you probably have two copies that match. But these aren't the only possibilities—they're just the main three, as we'll see in a moment.


1Most of these reasons come down to performance. Git's git commit is thousands of times faster than Mercurial's hg commit. Some of that is because Mercurial is written mostly in Python, but a lot of it is because of Git's index.

2More precisely, commits persist until nobody can find them by hash ID any more. That can happen when you switch from an old and lousy commit to a new and improved copy. After that, the old and lousy commits, if they're truly un-findable (as opposed to merely hidden from casual observation), are eligible to be removed by Git's garbage collector, git gc.


For each file, examine its state in the three copies

You've already picked some commit as the current (HEAD) commit, via git checkout. Git found that this commit has some number of files; it has extracted them all to both the index and the work-tree. Suppose you have just the files README.md and main.py. They are now like this:

  HEAD           index        work-tree
---------      ---------      ---------
README.md      README.md      README.md
main.py        main.py        main.py

It's pretty hard to tell from this table which file has which version, so let's add a version number:

  HEAD           index        work-tree
---------      ---------      ---------
README.md(1)   README.md(1)   README.md(1)
main.py(1)     main.py(1)     main.py(1)

This matches up with the Pro Git book's first state.

Now you modify one of the files in your work-tree. (These are the only files you can see and work on with ordinary non-Git commands.) Let's say you put version 2 of README.md into the work-tree:

  HEAD           index        work-tree
---------      ---------      ---------
README.md(1)   README.md(1)   README.md(2)
main.py(1)     main.py(1)     main.py(1)

Git will now say that you have changes not staged for commit to README.md. What this really means is that if we do two comparisons—starting with HEAD vs index, then moving on to index vs work-tree—we see same in first compare, different in second. This matches up with the Pro Git book's "modified but not staged" state.

If you now run git add README.md, Git will freeze-dry the updated work-tree version-2 README.md and overwrite the one in the index:

  HEAD           index        work-tree
---------      ---------      ---------
README.md(1)   README.md(2)   README.md(2)
main.py(1)     main.py(1)     main.py(1)

The one small subtle change in the table is that now, in the comparison, HEAD-vs-index shows README.md changed, while index-vs-work-tree shows them to be the same. Git calls this situation changes staged for commit. This matches up with the Pro Git book's "modified and staged" state.

If you make a new commit now, Git will package up whatever is in the index right now—i.e., the version one main.py and the version 2 README.md—and make the new commit using those files. Then it will adjust things so that HEAD means the new commit, instead of the one you had checked out earlier. So now, even though the old commit still has both files in their version-1 form, you now have:

  HEAD           index        work-tree
---------      ---------      ---------
README.md(2)   README.md(2)   README.md(2)
main.py(1)     main.py(1)     main.py(1)

and now all three copies of README.md match.

But suppose you change README.md in the work-tree now to make a version 3, then git add that:

  HEAD           index        work-tree
---------      ---------      ---------
README.md(1)   README.md(3)   README.md(3)
main.py(1)     main.py(1)     main.py(1)

Then you change README.md some more to make a version 4, different from all three previous versions:

  HEAD           index        work-tree
---------      ---------      ---------
README.md(1)   README.md(3)   README.md(4)
main.py(1)     main.py(1)     main.py(1)

When we now compare HEAD-vs-index, we see that README.md is staged for commit, but when we compare index vs work-tree, we see that it's also not staged for commit. This doesn't match any of the three states—but it's possible!

Tracked vs untracked

Tracked files are files that were in the last snapshot ...

This, unfortunately, is highly misleading. In fact, a tracked file is very simply any file that is in the index right now. Note that the index is malleable. It may have README.md version 3 in it right now—but you can replace that README.md with another version, or even remove that README.md entirely.

If you remove that README.md you get:

  HEAD           index        work-tree
---------      ---------      ---------
README.md(1)                  README.md(4)
main.py(1)     main.py(1)     main.py(1)

Version 3 is just gone now.3 So now the README.md that's in the work-tree is an untracked file. If you put a version—any version—of README.md back into the index before running git commit, README.md goes back to being tracked, because it's in the index.

Since git checkout fills in the index (and the work-tree) from the commit you check out, it's not wrong to say that files that were in the last commit are probably tracked. But as I say here, it's misleading. The tracked-ness is a function of the file being in the index. How it got there is not relevant to the tracked-ness.


3Technically, Git still has the freeze-dried copy as a blob object in its object database, but if no one else uses that freeze-dried copy, it's eligible for garbage collection now, and could go away at any time.


Git makes new commits from the index; new commits refer back to older ones

We already mentioned some of this above, but let's go over it again because it's crucial to understanding Git.

Each commit—really, each object of any kind—in Git has a hash ID specific to that one particular commit. If you write down the hash ID, and type it all in again, Git can use that hash ID to find the commit, as long as the commit is still in Git's master database of "all objects ever".

Each commit also has some number of earlier-commit hash IDs stored inside it. Usually that's just one previous hash ID. This one previous hash ID is the commit's parent.

Whenever you (or Git) has one of these hash IDs in hand, we say that you (or Git) has a pointer to the underlying object. So each commit points to its parent. This means that given a small repository with, say, just three commits, we can draw the commits. If we use single uppercase letters to stand in for our commit hash IDs, the result is a lot more useful to humans, though of course we'll run out of IDs pretty fast (so let's not draw more than just a few commits):

A <-B <-C

Here C is the last commit. We have to somehow know its hash ID. If we do, we can have Git fetch the actual commit from the database, and C holds the hash ID of its predecessor commit B. We can have Git use that to fish B out and find the hash ID of A. We can use that to fish out A itself, but this time, there's no previous hash ID. There can't be: A was the very first commit; there was no earlier commit for A to point back to.

All these pointers always point backwards, by necessity. No part of any commit can change after we make it, so B can hold A's ID, but we can't change A to stuff B's ID into A. C can point to B but we can't change B to make it point to C. But all we have to do is remember the real hash ID of C, and this is where branch names come in.

Let's pick the name master and have Git save C's hash ID under that name. Since the name holds a hash ID, the name points to C:

A--B--C   <-- master

(For laziness and/or other reasons, I've stopped drawing the connectors in the commits as arrows. That's OK, because they can't change and we know they point backwards.)

Now let's check out commit C, using git checkout master, which fills in our index and work-tree from the files saved with commit C:

git checkout master

Then we'll modify some files, use git add to copy them back into the index, and last, run git commit. The git commit command will collect our name and email address, get a log message from us or from the -m flag, add the current time, and make a new commit by saving whatever is in the index right now. That's why we had to git add the files to the index first.

This new commit will have commit C's hash ID as the new commit's parent. The act of writing out the commit will compute the hash ID for the new commit, but we'll just call it D. So we now have:

A--B--C   <-- master
       \
        D

But now Git does something extremely clever: it writes D's hash ID into the name master, so that master now points to D:

A--B--C
       \
        D   <-- master

and now commit D is the last commit. All we need to remember is the name master; Git remembers the hash IDs for us.

What about git commit -a?

Git does have a way to commit whatever is in your work-tree, using git commit -a. But what this really does is, in effect, to run git add -u right before doing the commit: for every file that's actually, currently, in the index, Git checks to see if the work-tree copy is different, and if so, Git adds that file to the index. Then it makes the new commit from the index.4

This intermediate, third copy of every file—the one in the index—is why you have to git add all the time. As a new user of Git, it mostly gets in your way. It's tempting to work around it with git commit -a, and pretend it doesn't exist. But that eventually leaves you stranded when something fails with a problem with the index, and it leaves tracked-vs-untracked files entirely inexplicable.

Also, the presence of the index allows for all kinds of neat tricks, like git add -p, that are actually pretty useful and practical for some work-flows, so it's not a bad idea to learn about the index. You can leave a lot of this for later, but just remember that there's this intermediate freeze-dried copy, and that git status runs two comparisons—HEAD-vs-index, then index-vs-work-tree—and it all makes much more sense.


4This, too, is a white lie: Git actually makes a temporary index for this case. The temporary index starts as a copy of the real index, and then Git adds the files there. However, if all goes well with the commit, the temporary index becomes the index—the real, main index, as it were—so adding to the temporary index has the same effect. The only time this shows up is when the commit fails, or, if you're sneaky enough, when you go in and inspect the repository state while the git commit -a is still in progress.

The picture gets even more complicated if you use git commit --only, which makes two temporary indexes (indices?). But let's just not go there. :-)

torek
  • 448,244
  • 59
  • 642
  • 775
  • "Mercurial is another version control system that is very much like Git": No, it is *very much not* like Git: it was built with a Subversion mindset (and most of Subversion commands) without understanding the bigger picture of such a tool, especially when distributed: collaboration. Git, with its index, was made to integrate multiple contributions (patches), as explained here: https://stackoverflow.com/a/6718135/6309. Yes, it has effect on performance, but first it was a powerful mindset shift, to support a new model, which eventually caught on starting with GitHub. – VonC Apr 27 '19 at 09:35
  • @VonC: Mercurial and Git were developed in parallel for some time, with ideas from each crossing over to the other. Fundamentally, the two systems have pretty equal power in terms of managing source code. They both use a DAG of commits. With the addition of bookmarks, you can use Mercurial identically to the way you use Git: just leave everything in the "default" branch. It's true that in the (extremely) early days, people would use Git in ways that Mercurial doesn't support—and you *can* still do that, by using `git read-tree` and `git write-tree` instead of `git commit`, but nobody does. – torek Apr 27 '19 at 16:30
  • Note that in its day, Bitbucket was the Mercurial equivalent of GitHub. These days Bitbucket supports Git, of course... – torek Apr 27 '19 at 16:31
  • don't forget Google Code (https://en.wikipedia.org/wiki/Google_Developers#Google_Code) :) – VonC Apr 27 '19 at 16:49
  • @VonC Well, yes, but it's pretty dead. There was KilnHG but it's also gone. On the topic of things that the index buys you (that are very hard to do in hg) I should also mention subtree split and merge—those make heavy use of `git read-tree`. – torek Apr 27 '19 at 17:09
0

It's easy to grasp* that these two categories are the same thing if you make them a bit more explicit.


"committed" means

just committed (implying "...and no other operations have been made since")


"unmodified" means

unmodified since the last commit


* (to basically answer the title question, but see torek's answer for the precious details)

Romain Valeri
  • 19,645
  • 3
  • 36
  • 61
0

Commit c3e7fbc (May 2005, Git v0.99) is the first instance where "unmodified" was used, and illustrates that "unmodified" files are files candidate for diff, even for renamed files:

[PATCH] Diff overhaul, adding the other half of copy detection.

This patch extends diff-cache and diff-files to report the unmodified files to diff-core as well when -C (copy detection) is in effect, so that the unmodified files can also be used as the source candidates.

This differs from the first occurrence of the term uncommitted, which shows what "uncommitted" is: commit 219ea3a, Sept. 2006, Git v1.5.3-rc0.

gitk: Show local uncommitted changes as a fake commit

If there are local changes in the repository, i.e., git-diff-index HEAD produces some output, then this optionally displays an extra row in the graph as a child of the HEAD commit (but with a red circle to indicate that it's not a real commit).
There is a checkbox in the preferences window to control whether gitk does this or not.

It included a comment like:

# tree has COPYING.  work tree has the same COPYING and COPYING.1,
# but COPYING is not edited.  
# We say you copy-and-edit COPYING.1;
# this is only possible because -C mode now reports the unmodified
# file to the diff-core.

Uncommitted remains the more general term, when dealing with tracked element.
A bit later, commit 6259ac6, Jul. 2008, Git v1.6.0-rc0 mentioned:

Documentation: How to ignore local changes in tracked files

This patch explains more carefully that .gitignore concerns only untracked files and refers the reader to

git update-index --assume-unchanged

in the need of ignoring uncommitted changes in already tracked files.

VonC
  • 1,262,500
  • 529
  • 4,410
  • 5,250