This is long, but it's probably a good idea for at least the OP to read it carefully.
Let me repeat TTT's comment, which is itself important so I'll highlight it again here (and add a bit of emphasis):
The misunderstanding comes from this statement, "but at first I do not track A.py
because it's identical in both branches". Actually, at that point it's not "in" any branches yet and it's just following you around until you commit it.
Besides this, though, you have another misunderstanding about Git that shows here:
I initialize git and create two branches branch-x and branch-y. They don't track any files yet.
This is literally impossible (but also nonsense, and I'm not sure which of these to give priority ). To avoid these misunderstandings, we need to go back to some basics about Git. Let's start with two definitions:
A Git repository consists mainly of a big database of Git commit objects plus other supporting objects. There's also a second (normally much smaller) database of names that exist to help you (and Git) find the commits because of what's in definition 2.
The commit is the central idea of Git. It's the be-all and end-all, the raison d'être for Git's existence in the first place. So you need to know, at a sort of deep gut understanding level, what a commit is, and this comes in three parts:
Each commit is numbered with a unique value called a hash ID or object ID (OID). This thing is normally expressed in hexadecimal; you'll see the commit numbers spill out when you run git log
, for instance. These numbers are huge and ugly, out of necessity: Git requires that every commit get a unique number. Once you make a commit, that number is used up forever, for everyone, in every Git repository everywhere!1 This trick enables two Git repositories to meet, at some unspecified point in the future, and learn which commits each other have by exchanging just the hash IDs. (We won't get to this part here.)
Every commit stores two things: (a) a full snapshot of every file, and (b) some metadata, or information about the commit itself: who made it, when, and why (their log message), for instance.
The files in each commit—the commit's snapshot—are stored in a special, read-only, Git-only, compressed and de-duplicated form. The de-duplication deals with the fact that many, or most, commits mostly re-use most of the files from some earlier commit. If a new commit's snapshot is 100% identical to some older commit's snapshot, the new snapshot literally takes no space at all. You can get this in several ways, including if you change one file and commit, then decide to change it back and commit again. In this case the only space needed in the repository is for the commit's metadata; the two commits literally share all their files.
This file-sharing across and even within commits is possible only because all parts of every commit—every internal Git object, in fact—are completely read-only. Neither you nor Git itself can change any commit, or any file within any commit. Not only that, most of the programs on your computer cannot read any of the files stored inside a commit.
This gives us a dilemma: if you can't use the files in a commit, what good are they?
1Git actually relies on a sort of statistical probability of uniqueness, which runs afoul of a couple of mathematical problems. So Git's scheme is doomed to fail someday. The huge size of the hash puts that day off as long as necessary, ideally billions of years or more. The "birthday paradox" or "birthday problem" is in some ways the worst one, and Git "handles" that one by the fact that we don't normally combine unrelated repositories.
See also How does the newly found SHA-1 collision affect Git?
You work with copies
Git's answer to this is actually pretty simple, and is the same as that found in many version control / source-code management systems (VCS or SCM). Put simply, you don't work on a commit at all. The commits are inviolate—pure— abstract and removed from the programmer. They are off limits. Only the VCS itself can do anything with the commits. Instead, you work with copies.
When you first check out some commit, in Git, Git will read all the stored files that are in the commit. It will expand them out of the Git-ified form they have there, and turn them back into ordinary read/write files and put those somewhere for you. That makes a copy of all the files that were in the commit.
These copies—the files extracted from the commit—go into a work area: a working tree or work-tree, which is simply the place where you do your work. You don't work on the commit directly, but rather with the files that came out of the commit.
When you're done doing work with these files and wish to make a new commit, you tell the VCS "make a new commit". The VCS packages up the files to be committed, adds any necessary metadata such as your name and the current date and time, and uses all of that to make the commit. Most version control systems use this general idea, but Git gives it a twist.
Your working tree, Git's index, and tracked vs untracked files
Let's pause here for a moment and review a few things:
Git's commits are in Git's object database (in the hidden .git
folder at the top of your working tree). These have snapshots, but they're not directly usable. The files in the commits aren't ordinary files, and they are not stored in folders.2
The files you work on / with are ordinary files, in ordinary folders. Git has no control over these files and folders while you work on and with them. This last part has something to do with why you must run git add
so often.
This dichotomy between the version-controlled files and the working tree files means that there are literally two copies of each "current version" file: the one in the commit, and the one you're working on / with. Many other version control systems have this same issue. When you invoke their "commit" action (however they may spell it), they use that particular command or point-in-time as a signal that they should now scan your working tree to figure out what you've done. Git does not work this way.
Instead, Git keeps a third copy of each file! Perhaps we should say third "copy", because this extra copy, that sits between the committed copy and the working copy, is stored in the compressed and de-duplicated form. Since the initial third "copy" of each such file came out of the commit you chose to start with, it must match the committed copy. Since it's always pre-de-duplicated, and it's a duplicate, it already doesn't exist: it just shares the committed copy, invisibly.
This third copy of each file is in a thing for which Git also has three names. Git calls this the index, or the staging area, or—rarely these days—the cache. You mostly see that third name in the form of flags, like git rm --cached
. The first name, "index", is kind of meaningless (which in some ways makes it the best name), and its synonym "staging area" refers to how you use it:
git add path/to/file.ext
tells Git to read the copy of file.ext
that lives in folder to
of folder path
of your working tree. Git will scan the file, compress and de-duplicate its contents, and then add-or-replace the index copy of a file named path/to/file.ext
.3
When you first check out some commit to begin working on or with it, Git will:
- remove, from its index and your working tree, any files that are there from the previous commit; then
- add, to its index and your working tree, any files that must be there for the commit you'll be working on / with.
In this way, the index and your working tree now match the commit you've selected. As you work and run git add
on files, Git updates its index copies, so that they're ready for the next commit. Then, when you do run git commit
, Git can simply package up all the index files to go into the new commit. As they're already in the compressed and de-duplicated format, it's easy for Git to store them as a new commit.
What this means, in the end, is simple enough: Whatever is in Git's index right now is what Git thinks you plan to put in your next commit. As you work on files in your working tree, you must run git add
on them, so as to update the index copy for your next commit. So the index can be described relatively simply as what you plan to put into your next commit. That's why we call it the staging area: you "stage" a file to make it ready for the next snapshot. However, the stage is pre-set based on the last snapshot. (The index takes on an expanded role during git merge
, so it's more than just the staging area, but that's its main use here.)
But there's a trick here. Your working tree can contain files that are not in the index. Why are they not in the index? There's one obvious possibility: perhaps you just created that file, just now. In this case the file is not in Git and maybe has never been in Git. It's just there in your working tree. This is the case, initially, for your A.py
file.
There's a second possibility, which we should cover now, though you didn't use it. You can tell Git remove the file path/to/file.ext
from your index. You normally do this with git rm
, and normally, when you use git rm
, you tell Git to remove both the index copy and your working-tree copy. So now it's absent from both places. But you can tell Git *not *to remove the working tree copy at all, using git rm --cached path/to/file.ext
. Now there's a working tree copy, but no index copy.
The important thing to note at this point is this: Regardless of how we arrive at this condition, it is possible to have a file in your working tree that has no corresponding copy in Git's index. When this is the case, Git calls that file an untracked file. So an untracked file is, by definition, a file that isn't in Git's index, and a tracked file is, by definition, a file that is in Git's index. This is all very simple, but there's one hitch: you can't see what's in Git's index directly.4 Instead, we use git status
to see what's in Git's index indirectly.
Running git status
produces output like this:
On branch main
Changes to be committed:
(use "git restore --staged <file>..." to unstage)
modified: scanner.go
Changes not staged for commit:
(use "git add <file>..." to update what will be committed)
(use "git restore <file>..." to discard changes in working directory)
modified: reader.go
Untracked files:
(use "git add <file>..." to include in what will be committed)
reader_test.go
This displays a lot of information in a fairly compact form. We can get the same information even more compactly with git status --short
(though this leaves out the on branch main
part by default):
M reader.go
M scanner.go
?? reader_test.go
Note that the position of the M
character is different for reader.go
(unstaged) and scanner.go
(staged) here. They also have different colors (but this doesn't show up in plain StackOverflow text).
What Git is doing is comparing things:
First, Git compares the current commit's files to the staged files. Most of them are the same, so Git says nothing at all about these. One of them is different: scanner.go
is different, so Git says that this file is staged for commit
. In other words, I edited scanner.go
, but then I also ran git add scanner.go
, so now the index copy matches the working tree copy, and both differ from the committed copy.
Then, Git compares what's in the index—the staged files—to your working tree. Here, most of them are the same, including scanner.go
, but reader.go
is different in the index and in the working tree. That's because I modified reader.go
but have not yet run git add reader.go
. So the committed copy and the index copy match—it wasn't mentioned in the "to be committed" section, in the long output—but the index and working tree copies differ, so it's not staged for commit
.
Last, I created reader_test.go
and have not run git add
on it yet. So it's not in Git's index at all, but it is there in my working tree. The status
command therefore complains about it. This fact actually shows up in the second comparison, but Git holds off the complaints until after it finishes listing any "not staged for commit" files.
In the short status form, the staged-for-commit M
status shows up in column 1, the not-staged-for-commit M
status shows up in column 2, and the untracked
status shows up as the two question marks in both columns. You can get both columns 1 and 2 to show an M
status: just check out some commit, modify some file, git add
that file, and then modify the file again. Now all three copies differ, so you'll get two M
(modified) status fields.
If I were to git add reader_test.go
, that would put a new file into the index. Running git status
will show this as a newly added file, to be committed
, or a letter-A
in column 1. Again, this is all based on the state of things when I run git status
: at this point, there would be no file named reader_test.go
in the commit, but the two other copies, in Git's index and my working tree, would match. So the difference between the current commit and the proposed next commit includes "add this file", and the difference between the proposed next commit and the working tree, for this particular file, is "no difference, so don't mention it".
2If you want to explore this idea further, look into loose vs packed objects and the difference between a tree object and a blob object in Git. Loose objects are in individual files, but they don't have normal file names, and packed objects are crammed together into a single file with delta compression applied. Your ordinary working tree files do have a correspondence to blob objects, which map 1-to-1 to loose files but many-to-one in pack files. It's relatively easy to read a loose object file—once you know which object name you want—but much trickier to extract the desired object from a pack file.
3Note that files in the index are not in folders: they just have path names that include forward slashes. This is true even on Windows: if your file is in path\to\file.ext
in your working tree, it's still under the name path/to/file.ext
—one file name, no folders, with forward rather than backward slashes—in Git's index.
4Actually, you can see what's in Git's index, using git ls-files
, typically with --stage
. But for a big repository, this produces a huge and therefore largely useless list of files.
The trickiest part: a new repository is empty
When you run git init
to create a new .git
directory containing a new Git repository, you've told Git to create the two repository databases. Remember those from our first definition:
There's a commits-and-other-objects database. It is empty: no commits exist, and there are no supporting objects for all zero of those (lack of) commits.
There are no branch names, no tag names, no other names at all. We didn't cover this earlier, but each name stores exactly one hash ID. A branch name in particular stores the hash ID of the latest commit for that branch.
Since there are no commits, there cannot be any branch names. You can't create any branch name—not branch-x
, not branch-y
, not main
or master
, nothing—until there is at least one commit.
Let's make a new empty Git repository and observe this effect:
$ mkdir tt && cd tt
$ git init
Initialized empty Git repository in .../tt/.git/
$ git branch branch-x
fatal: Not a valid object name: 'master'.
$ git branch
$
There are no branches! That's why git branch
printed nothing. And yet:
$ git status
On branch master
No commits yet
nothing to commit (create/copy files and use "git add" to track)
I'm "on" branch master
, even though branch master
doesn't exist. And if I use git switch --orphan
, I can change which non-existent branch I'm on. Note that I need the --orphan
flag:
$ git switch main
fatal: invalid reference: main
$ git switch --orphan main
Switched to a new branch 'main'
$ git branch
$ git status
$ git status
On branch main
No commits yet
nothing to commit (create/copy files and use "git add" to track)
$
Is that weird, or is that weird? But the fact is, you can't have a branch name until you have commits, and yet you can have a current branch name when you have no commits. It's just that this current branch name doesn't exist!
The trick here is that Git will create that branch name when you make the first commit. So let's make a commit:
$ git commit
On branch main
Initial commit
nothing to commit (create/copy files and use "git add" to track)
$
Whoops! It turns out that we can't make an empty commit!5 So what we'll probably do—what I often do for instance—is create an initial file, maybe just a README
, and commit that:
$ echo just for illustration > README
$ git add README
$ git commit -m "Initial commit"
[main (root-commit) 96800d2] Initial commit
1 file changed, 1 insertion(+)
create mode 100644 README
$
Now that I have one commit, now I can create new branch names:
$ git branch branch-x
$ git branch branch-y
$ git branch
branch-x
branch-y
* main
Running git log --all --decorate --oneline --graph
, we see that this one commit has all three branch names pointing to it:
$ git log --all --decorate --oneline --graph
* 96800d2 (HEAD -> main, branch-y, branch-x) Initial commit
$
But this commit does have one tracked file, namely README
. So to create two new branch names, I had to create one file, make it tracked (git add
it to copy it into Git's index), run git commit
to create my first commit and hence my initial branch. My initial branch name is now main
, since I changed the name of the nonexistent branch before I made it exist; the git commit
created the branch name and that very-first commit.
The very first commit in any Git repository is at least a little bit special, because of this bootstrapping issue. So if you use GitHub or some other hosting site to create a repository, they will often fill in an initial commit for you, containing some initial file(s) such as a README, LICENSE, Copyright, etc., and maybe the misleadingly-named .gitignore
.
Once you have the first commit, and its corresponding branch name, then you can make more branch names. But they'll have tracked files—or rather, the commit they select will have some files, and when you check out that commit, with git switch
or git checkout
, those files will be tracked.
5You can make an empty commit, using git commit --allow-empty
. The --allow-empty
flag really tells Git: Let me make this commit now even though there's nothing new. When you do this from a commit that has files, you get another commit that just re-uses the existing snapshot. When you do this in this special initial-no-files-no-commits setup here, you get a truly empty commit, one whose snapshot is the empty tree. But people normally don't do that, and the phrase empty commit in Git normally means "a commit that matches the previous one". There is one valid reason for making this kind of "empty" (really, matching) commit, and that's to have a new unique hash ID for a new tag: see my answer to Select Git tag from a "list" once I create it.
Summary, or, what you've learned here
In bullet-point form:
A Git repository consists mainly of commits and supporting objects, stored in Git's object database, plus some names—branch names, tag names, and the like. The commits are numbered and hold snapshots and metadata, and the names help you find the commits.
A branch name exists to help you (and Git) find the latest commit on the branch. We have not gone into the mechanism here, but that's what it does. What this means is that the branch name selects one particular commit. For the name to exist, there must be a commit for it to select.
Checking out a commit fills in Git's index / staging-area, and your working tree, from the files that are in that commit's snapshot. It first removes from the index and your working tree any files that are there because of some previously-checked-out commit. It does not disturb files that are sitting around in your working tree, but are not in either the commit you're moving off of, or the commit you're moving to.
An untracked file is one that is not in Git's index right now. The index contents can and do change: you can git add
, you can git rm
, and you can switch from commit to commit. So the set of files that are tracked also can and does change. But at any time, if some file exists in your working tree right now, and isn't in Git's index right now, that file is untracked. If that file is in Git's index right now, that file is tracked.
Files that are in your working tree—the copies that aren't in Git's index or the current commit—aren't in Git at all! At most, they came out of Git, and if you git add
and git commit
they'll go into Git. But the copies in your working tree are yours to futz with, up until you choose to switch commits, or decide to use a Git command that uses or changes your working tree contents, such as git add
or git rm
or git restore
.
The index or staging area holds your proposed next commit. Use git add
or git rm
to update your proposal.
If you want Git to save a file into the next snapshot, use git add
on that file. If the file came out of the current snapshot, Git will replace the tracked copy with an updated tracked copy. If the file wasn't tracked before, Git will copy the file into Git's index, and now it's tracked.
If you want Git to omit the file from the next snapshot even though it's tracked, use git rm
to remove both the index and working tree copies of that file. Note that when Git compares the previous snapshot (which has the file) to the new snapshot (once you make it), Git will say that this file has been "deleted", even if you used --cached
when you ran git rm
. It's important to remember that the file in the old commit is stuck there forever: no commit can ever be changed! If you ever check out the old commit, Git will copy the old version of the file out into your working tree.
There is a lot more to learn—in particular, the next thing to learn is that branch names don't really matter to Git; what matter are the commits—but this is a good starting point.