For various reasons, people find both the idea of "tracked/untracked files", and branches, quite mysterious. But in fact, they're not.
The first notion to let go of is branches. They don't really mean anything! Well, that is, they mean nothing that people mean. They have some very specific definitions, and in fact, the word "branch" in Git has two different meanings. For more on this, see What exactly do we mean by "branch"? For now, though, think about what Git is doing purely in terms of moving from commit to commit—because this is where the issue comes from.
Commits, and how they form branches
In Git, the commit is almost everything. It's the overriding goal; it's the glue in the repository, and the reason for Git's existence. There's always1 a current commit, called HEAD
. But what, precisely, is a commit? The answer is that it consists of two or three parts, depending on how you count:
A commit stores a snapshot of a work-tree.
The work-tree or working-tree (or some variant of this spelling) is where you see your files, and edit them, and otherwise use them. The form in which they're stored inside the repository is no good for this, so Git provides you with a work-tree in which to, well, work.
The snapshot in a commit lets you access (as in git checkout
) any earlier version you have committed. That is, if you made two commits yesterday, and three on Friday, you can view the entire work tree as it was either way yesterday, or all three ways on Friday. To do so, you simply git checkout
the commit, naming it via its big ugly SHA-1 hash ID, c0ffeeface
or whatever. (You'll see these IDs whenever you run git log
.)
In addition, a commit stores some metadata. In particular, each commit carries the name and email address of the person who made the commit, and a time-stamp. (In fact, there are two of these name / email / time-stamp triples, one for the "author" and one for the "committer", because of Git's history of emailed patches: this allows someone to email a patch and be the author, while someone else actually does the committing.)
In with this same metadata—though you might want to think of it separately—Git keeps a parent ID. The parent of each commit is the commit that was in place just before you made the new commit. Git is then able to use these parent links to navigate through the history of commits—only, it's backwards, working exclusively from "more recent" to "older". (The reason it is—and must be—backwards is that every internal Git object is read-only: once it goes in, it never, ever changes. It would make more sense to people for commits to remember their children, rather than having them remember their parents; but to do so while being read-only, the children would have to be born first, or at the same time as the parents. So Git has the children record their parents instead of the other way around, since the children are inevitably born later.)
By using these parent links, Git can not only work backwards in history, it can also show you what changed. If the parent commit has a work-tree with a README
file that says that apples are purple, and the child commit has a work-tree with a README
file that says apples are green, Git can compare these two commits and say: "going from parent to child, you changed apples from purple to green."
This, in fact, is where branches—both the notion itself, and the names like master
—come from. Sometimes, you want to "make a branch" so that changes will relate to an older or at least different parent:
A--B--C--E--G <-- master
\
D--F <-- branch
The name master
here refers to commit G
, the 7th commit ever made. Commit G
's parent is not F
, though, but rather E
; and E
's parent is C
, whose parent is B
, whose parent is A
(and then we hit a so-called root commit that has no parent: obviously the first commit ever made has to be one of these). Meanwhile, the name branch
refers to commit F
, whose parent is D
, whose parent is C
. So commit C
actually has two children, D
and E
.
The key here is that the names, master
and branch
, don't really mean anything to Git. They're just ways to get to the big ugly SHA-1 hashes. Git remembers that master
means beadc0de
and branch
means feedbeef
, so that if you say "I'd like to work on master
now" Git knows to get commit beadc0de
. And then, when you make a new commit, Git automatically updates the current branch so that it has the new commit's ID in it, storing the old ID as the parent of the new commit (this is how branches grow).
So (as noted in What exactly do we mean by "branch"?), when humans say the word branch, they can mean the branch name—the word master
, for instance—which simply locates the tip commit of the branch. Or, they can mean "some or all of the commits that can be found by starting at the branch tip and working backwards through history", so that master
means all the commits back to A
except for D
and F
, and branch
means all the commits back to A
except for E
and G
. Note that in this case, commits A-B-C
are in fact on both branches.
1There's a problem with "always" in a new, fresh, empty repository: there are no commits, so there's no commit to be the current HEAD
commit. Git handles this with some special cases, which we can just ignore here.
The index, and what it means to be "tracked"
The first problem we find with a Git snapshot vs a work-tree is that, for various reasons, we need to put extra files into real work-trees. In particular, if we compile code, or have temporary files or local configurations, or for any number of other good reasons, we need to have files that don't get committed, but live in the work-tree anyway. So all version control systems provide some way to have "non-versioned" files as well. Git's approach here, however, is unusual, perhaps even unique. What Git does is to expose something most version control systems keep hidden.
In Git, you build up the next commit in something variously called the index, the staging area, or sometimes (as in git diff --cached
) the cache. These are all words for the same thing. The short version of the index is that it's simply "where you build the next commit".
To make a commit, you start with a work-tree, which holds versioned (tracked) files and other (untracked) files. You edit some file(s) in some way and then run git add
. What git add
does is simply to copy the file into the index. Then, once you have everything staged the way you like, you run git commit
, and at this point Git makes the new commit from the index. But: What happens to the index afterward?
The answer is ridiculously simple: nothing. The index continues to hold the commit you just made!
This is therefore what it means for a file to be tracked: it's in the index.
That's it—that's all there is to it. A file is tracked if and only if it is in the index. If it is tracked, it will be in the next commit. If it is not tracked, it will not be in the next commit.
What about .gitignore
?
The name .gitignore
is misleading: it's not exactly files to ignore. The drawback to having untracked files is that Git constantly complains about them. (Git: "whine! file foo is untracked! are you sure you want that? whine, whine") Putting a file name, or a matching pattern, into .gitignore
mainly just shuts Git up about the untracked-ness. It doesn't actually make the file untracked: the file is untracked if and only if it's not in the index. It does make Git automatically skip the file when you say "add everything", though, and that's usually what we want.
Putting a file into .gitigore
has one bad side effect though: it tells Git that Git should feel free to destroy the file as well, if necessary. There's an interesting side twist here as well, because the .gitignore
file itself is usually tracked. So now it's time to consider how git checkout
works.
How git checkout
really works
I mentioned above that Git mostly cares about moving from commit to commit. This is true for git checkout branchname
as well: Git translates the branch name into a raw commit hash, so as to get the files that go with that commit. However, when you check out a branch by name—as we usually do—Git saves that name as the current branch as well, so that it knows which branch name should get the next commit. If you check out a commit by its raw ID, you get what Git calls a "detached HEAD".
All that this "detached HEAD" means is that Git has a commit checked out by its raw ID. (This has consequences if you make new commits, so usually you want to get "back on" a branch, by checking out a name instead of a hash ID.) Meanwhile, though, Git still has the problem of moving from one commit to another, whether or not it's going to store the branch name for the next commit.
What Git does here is to use the index again. Again, the index always holds the next commit to make—but when you've just made one, so that the index and work-tree are "clean" and git status
says "nothing to commit", the index and work-tree already match the current (HEAD
) commit.
Let's say you're currently on master
which is beadc0de
, and you say git checkout branch
which is feedbeef
. The index (and work-tree) matches beadc0de
, so Git compares beadc0de
and feedbeef
to see which files are different. It then replaces, in the index and the work-tree, those files. That includes the file .gitignore
, if it's different!
Meanwhile—this is where your removed files come in—what if there are files in beadc0de
that are not in feedbeef
, or vice versa? What Git does here is just as simple as before: it removes files that aren't in the commit we're moving to, and creates files that are in that commit. This involves removing files from the work-tree, or writing new files into the work-tree.
Removing existing files from the work-tree clobbers them. Git normally tries hard not to clobber files, but—uh oh—if they're listed in .gitignore
, Git feels free to clobber them!
So, if branch
(i.e., feedbeef
) has a .gitignore
that ignores some files, and master
(beadc0de
) has those files tracked, Git can safely remove the files. They're stored in beadc0de
, so you'll get them back when you switch back, and they're ignored in feedbeef
so it's safe to clobber them. (In fact, I think being stored in beadc0de
is sufficient here, although the rules get a bit squirrelly with files like .gitignore
and .gitattributes
that sometimes switch with checkout.)
This index-and-work-tree comparing thing, by the way, is also how (and why, and when, and why not when it won't) Git lets you switch from one branch to another with uncommitted files. Git works very hard to do as little work as possible, so if it can switch from one commit to another without touching a file in the index and work-tree, it does so.