Could you also write what commands exactly should I use and in which order?
It's good to learn what, but it's even better to learn why. That is, knowing the fundamental working model that Git uses will help a whole lot.
The first and most important thing to know is that, while you may be using Git to store files, Git isn't really about files at all. Git is all about commits. A Git repository is a collection of commits. The commit is the unit-of-interaction, mostly. If you're into chemistry (or biochem), think of the commits as the molecules (or proteins): yes, they're made up of smaller pieces (atoms or amino acids), but if you take them apart like this they lose crucial properties. So you deal with them as a unit, most of the time.
If a repository is a collection of commits—this high level picture isn't quite right and we'll refine it in a moment, but it's a good start—then you'll need to know, in detail, what a commit is and does for you. The basic properties of each commit are these:
A commit is numbered, with a unique ID number. This number is very large and random-looking and expressed in hexadecimal. Git calls these things hash IDs or, more formally, object IDs or OIDs.
When I say unique, I really mean just that: the OID (or hash ID) of a commit is unique to that one commit, across all Git repositories everywhere. If any Git repository has a commit that that ID, it has that commit. If it doesn't have that commit, it has no object with that ID. This makes it easy to hook two repositories up to each other: they can just inspect each other's IDs, to determine who has which commits. Then one of them can send, to the other, the commits the other one needs but lacks.
Each commit stores a full snapshot of the entire set of files that go with that commit. You might think that this would make a repository grow enormously fat very quickly. It would, except that Git has a number of clever storage tricks. The most important one—worth knowing as it explains a lot of other things about Git—is that the files are stored in a de-duplicated manner. In particular, the contents of any one file are compressed—sometimes highly compressed, although that happens later—and then Git uses the same hashing trick to see if the same contents have already been stored. If so, Git just re-uses the previous content.
To make this work, all the files in a commit are read-only: they're frozen for all time. This makes it impossible to work on those files; we'll see the result of this in a moment.
Besides the frozen, Git-ified, de-duplicated files, each commit stores some metadata, or information about the commit itself. This includes things like the name and email address of the person who made the commit, from their user.name
and user.email
settings (which Git does not verify: you can set these to whatever you want; they are not used for authentication).
Crucially for Git itself, one of the items in the metadata is a list of previous commit hash IDs. This list typically contains just one hash ID: it's a list so that it can have no hash ID at all, or two or more, but usually its just one.
Like the files, the metadata is frozen for all time. This is required by the hashing scheme. It is possible to take a commit out of the objects database, make some changes (to the metadata or to some files or both), and put the result back, but when you do that, you get a new commit with a new unique hash ID. The old commit remains in the database.
Commits form a backwards chain
Each commit holds the hash ID of a previous commit (except for those special cases where the list is shorter or longer, which we'll ignore for a moment). We say that this means the commit points to the earlier, or parent, commit. We can draw this: let's pick some uppercase letters to stand in for commit hash IDs. We'll start with H
for "hash":
<-H
The "arrow" sticking out of H
, pointing backwards (leftwards), represents the stored hash ID in H
's metadata. That's the hash ID of some commit that existed when we made H
. Let's call this earlier commit G
and draw it in:
<-G <-H
Of course, commit G
has a parent too; let's call it F
and draw it in:
... <-F <-G <-H
This just keeps on going and going, until we run out of previous commits because we get to the very first commit, which is a little bit special: its list of previous commits is empty. It just doesn't point backwards! This lets Git stop going backwards. If we get a little lazy about the arrows, we can draw the full sequence of eight commits:
A--B--C--D--E--F--G--H
Now, we didn't really cover this earlier, but to find anything in the objects database—this database holds the commit objects plus all the supporting objects (like "blobs" that hold file data) that make the commits actually useful—Git needs the hash ID of the thing, whatever it is. So for a commit, Git desperately needs the hash ID H
to find H
. We can write it down or something. (There's obviously a better way, but let's start with "we wrote it on paper / whiteboard / whatever".) We run, say, git log
to look at our commits, and give Git the hash ID of commit H
.
Git will use this hash ID to fish H
out of the objects-database. That gets Git the metadata, and (indirectly) the full snapshot of all files, though git log
by default doesn't use the snapshot here: it just prints the log message along with the author and/or committer and so on, from the metadata.
But then, with H
in hand, Git uses the parent hash ID to find G
. The git log
command then shows us commit G
. And of course G
points back to F
, so git log
proceeds backwards to F
and shows F
, and so on. When git log
has shown B
and moves back to A
and shows it, there's no "moving back" any more—A
has no parent—and git log
can finally stop.
Branch names
This sequence of commits, ending at H
, can be called a branch. People (and Git) do call it a branch. But there's an obvious problem: we had to write down some big, ugly, random-looking hash ID. These are hard to type in. Why are we doing this when we have a computer? Let's store the hash ID in, say, a file! Let's call this file refs/heads/main
. We'll stuff the hash ID of H
into this file.
Then—here's an idea—let's call refs/heads/main
a "branch"! Wait, don't we already have something called a branch? Yes, we do, but we're sloppy humans, who get stuff wrong, so we call this a branch too. We call it "branch main
". It contains a hash ID, so it points to a commit, just like commits point to their parents; so let's draw it that way:
...--G--H <-- main
If and when we choose to make a new commit—however we do that—this new commit will use H
as its parent. Let's call the new commit I
and draw it in:
...--G--H <-- main
\
I
In order to find commit I
easily, let's have Git stuff whatever unique hash ID I
got into the name main
:
...--G--H main
\ ↙︎
I
There's no real reason to put I
on a separate line, so we can improve our drawing by just shoving I
in between H
and main
:
...--G--H--I <-- main
So a Git repository is really two databases. One is a simple key-value store where the keys are hash IDs, and it contains commits and other supporting objects. The other is also a simple key-value database, but its keys are branch names, tag names, and all other kinds of names (all of which resemble file names and which Git sometimes stores in files, but not always, so that makes them kind of weird: this particular database is badly implemented, at least for really big projects).
That's the bare minimum for any repository: the two databases. A "non-bare" repository comes with a third area, which ... well:
With the above in mind, let's work on / with a commit
As I said earlier, the files in any given commit are stored in a weird, Git-only fashion, being compressed and de-duplicated and frozen for all time. Literally nothing can write them, and only Git can read them. That makes them really hard to work with. They're almost useless! Except—well, they work just like an archive, like a tar or WinRAR or zip archive of files. What we have to do, then, is extract the archived files.
We do this with git switch
(since Git 2.23) or git checkout
(older versions—it still works, it's just that it has some, um, issues, so it's best to use the new commands if possible). We tell git switch
that we'd like the latest commit from main
, for instance:
git switch main
and Git will find whatever commit hash ID is stored in the name main
and use that hash ID to extract all the files. Note: If you have not run git switch
or git checkout
yourself (ever), that's probably OK, because git clone
ends by running this for you. We're going to skip all the git clone
details but it's really just a fancy wrapper that runs five or six commands for you (one non-Git command, and four or five Git commands), with the last one being git switch
.
The extracted files go into a work area, which Git calls your working tree or work-tree. This is the third part of a normal repository. (A "bare" repository omits this work-tree; we won't talk about why, here.)
Note that having been extracted from the objects database, these files are not in Git. They're in your working tree (which Git made for you), but that's just an ordinary folder with ordinary files in it. Git does not manage these files at this point: you manage them, with ordinary commands from your ordinary command line.
Every once in a while, though, you'll use a Git command that tells Git look at some or all of my working tree and do something with it. This is where the Git commands you're asking about come in.
Before we get too involved with what those commands are, we need to talk about one other thing that Git does, that—if you're used to other version control systems—is rather weird. Most version control systems use this same pattern: you "check out" something to work on it, then you work on it, then you "check in" or "commit" the work. Git adds a special wrinkle.
When Git copies all the files out of a commit during git switch
, Git makes a third (or 1.5th?) copy, or "copy" perhaps, of each file. That is, each file is stored in the commit as a frozen-for-all-time copy that's de-duplicated against other copies. But before Git extracts this frozen copy to your working tree, Git first puts a pre-de-duplicated copy in what Git calls its index or staging area (two words for the same thing).
Since this "copy" just came out of a commit, it's automatically a duplicate, and hence it takes no space. It does take very roughly 100 bytes for the entry in the index / staging-area, which holds the file's name and some cache data, leading to a third name for the index/staging-area/cache. But cache
is the worst name, and you mostly see the name "cache" in flags now, like git rm --cached
.
The difference between the index copy and the committed copy is that you can replace the index copy, any time you like, with an updated (but still pre-de-duplicated) version of that file. You can also remove the index copy, with git rm
for instance, and you can put into the index a totally-new file.
Having made the index copy (or "copy" since it's a duplicate) of the file, Git then de-compresses / de-Git-ifies the frozen-format file, turning it into an ordinary file in your ordinary working tree where you can use ordinary (non-Git) commands on that file. So now you finally have something you can see and, if you like, edit.
You work on these files, to any extent you like. Then you get to the Git commands:
git add file
tells Git that it should read the working tree copy of the file, and make the index copy match. Git will read the file, compress and Git-ify it, and check for duplicates. If there is a duplicate, Git switches the index copy around to use the other duplicate. If not, Git has the file pre-compressed now, ready for committing, and puts that in the index.1
This is also how you add a totally-new-to-Git file. Git reads and compresses the file, checks for duplicates, and then updates the index with the new file.
git add .
tells Git that it should scan the current directory and put all updated files in, similar to git add file
for every file in this directory.
git add directory
tells Git that it should read the entire given directory and add all its files, recursively adding any sub-directories (andin fact this is how git add .
works).
git add -u
, which is a command I use a lot, tells Git that it should find all modified files on its own and add all of those.
Note that in most cases, Git is merely updating the index copies. The index holds all the files: they all came out during the git switch
, and went into the index. So they're already there and we are just updating the pre-de-duplicated, ready-to-commit copies of the added files—except when we git add
a totally-new file name. That one wasn't in Git's index. Now it is.
Note too that git add .
or git add dir
is a kind of en-masse "add everything" operation. So this, too, can add all-new-files to Git. However, git add -u
isn't this kind of operation: instead, it tells Git to scan through the index and check to see if some of the already catalogued files are modified. So git add -u
will never add a new file, it will only update existing files. So: how you do know what these commands are going to do? The key here is to use git status
, but `git status is a little bit complicated.
The main thing to remember at this point about Git's index is that it acts as your proposed next commit. We'll see more about this in a moment.
1Technically, Git always puts just the hash ID (and cache data and staging number) into the index. You can, if you like, imagine a system where Git puts the new-to-the-repository contents in a separate area, hashed and ready to go, and then moves them into the database during git commit
: this would work fine and Git actually does something like that in certain other cases. But in fact Git just cheats and dumps the contents directly into the objects database, relying on cleanup maintenance tasks to rip them back out if you wind up not using them after all.
git diff
Let's take a brief (very brief, hardly touching on anything here) side trip over git diff
. We can run git diff
on any two commits. Git will extract, to a temporary area in memory, all the files from both commits and will then compare those files. Since they're pre-de-duplicated, Git can cheat and only compare the ones that are actually different: Git knows "in advance" that the duplicates are identical. So that makes this kind of git diff
pretty fast.
Git will then compare two files that have the same names, but different contents, and show us Git's own idea of what changed. Git is basically playing a game of Spot the Difference here. If we use this on the parent and child commits, we see what we changed (or some approximation of that, depending on how clever Git is with its spot-the-difference game).
We can run git diff --name-only
or git diff --name-status
to suppress the actual comparison of the contents, and just tell us which files changed. Why does this matter? Well, let's move on to git status
.
git status
What git status
does is try to tell you about the state of your working tree and anything else Git thinks might be important:
- Git will tell you which branch name you're using (we won't cover multiple branch names here though).
- Git will give you "ahead" and/or "behind" counts for this branch compared to its upstream (we won't describe the upstream of a branch here; I'll just note that it's optional, so not all branches have one). These are counts of commits and to describe them well, we would have to talk about multiple branch names.
- Now
git status
goes on to tell you about changes not staged for commit and changes staged for commit (and also untracked files).
The first thing to keep in mind here is that Git does not store changes. So what could Git be talking about?
Well, we already saw that we can run git diff
on any two commits. If we add --name-only
or --name-status
, Git tells us which files are different. So: what if we had Git run git diff
, but this time, compare the current commit to the files in Git's index? We'll do this with --name-status
so that we get:
- file
mod.ext
is modified (in the index, because you ran git add
)
- file
new.txt
is new (new in the index, because you ran git add
)
- file
vanished
has been removed (from the index, with git rm vanished
for instance)
These will be changes staged for commit. Git is telling us that if we turned the index, with the files it has in it right now, into a new commit, these are what would be different in the new commit. For two files that match (are duplicates that are de-duplicated), Git says nothing.
We can now run git add
or git rm
or whatever, if this list isn't right, to change what's in the index. Running git status
again will give us a new comparison of current-commit-vs-the-index. So git status
lets us know what's changed in our proposed commit.
Next, having compared the current commit to the index, Git will do a second (and rather harder) comparison: it will run a git diff
that compares what's in Git's index to what's in your working tree. This git diff
also gets the --name-status
treatment, because git status
is not going to show you the spot-the-difference differences, just the file names and such.
If some file is changed—that is, is there in both the index and your working tree, but the index contents don't match the working tree contents2—Git will call that file not staged for commit. If the contents match, Git will say nothing at all.
As before, if you've deleted a file (rm vanished
) but it's still there in Git's index, Git will say that a deletion is "not staged for commit". But in another special case, if a file is there in your working tree, but not in Git's index, this is a separate category: Git calls this an untracked file.
2Note that this requires either de-Git-ifying the index copy, or Git-ifying the working tree copy. These could be equivalent, but if you configure Git to mess with line endings, they could be different! This is where things get really complicated.
Untracked files and .gitignore
I'm going to be as brief as I can here again, so this won't cover everything, but:
An untracked file is a file that is in your working tree, but not in Git's index right now. That's all there is to it.
git status
usually whines about untracked files.
But, some files not only should be untracked right now, they should stay that way forever. An example is the compiled .pyc
or .pyo
files that Python makes: those should never be committed.
You can use the .gitignore
file to (a) make Git stop whining about an untracked file and (b) make en-masse "add everything" operations not add such files. Once a file is tracked, however, listing it in .gitignore
stops having any effect. You must manually remove that file from Git's index (so that it's now untracked again) and now the listing in .gitignore
has an effect again.
Proper use of .gitignore
makes git status
useful. You'll see changes staged and/or not-staged for commit, and you won't have thousands of untracked files that obscure the useful information. If you don't have a lot of files that you need to keep untracked, you don't need .gitignore
. If you're willing to put up with a lot of whining, and are willing to git add
every file individually, you don't need .gitignore
. But having .gitignore
makes use of Git much more pleasant, as it stops the whining and enables en-masse "add ." operations.
git commit
The last command here is git commit
. We've actually already described what it does:
- It makes a new snapshot from whatever is in Git's index right now.
- It collects metadata from you (e.g., a log message and your
user.name
).
- It uses all of this stuff to produce a new commit object, which gets a new, unique hash ID.
- It writes this new commit's hash ID into the current branch name.
Since the parent of the new commit is the commit that was current until git commit
finished, and since Git needs to write the new commit's hash ID into the current branch name in the repository's names database, Git needs to know, at all times, both the current branch name and the current commit hash ID.
The mechanism Git uses for this is the special (magic) name HEAD
, written in all uppercase like this. If you want a shortcut for HEAD
, use @
: don't use lowercase head
. That (lowercase head
) sometimes works, particularly on Windows and macOS, but as soon as you start using git worktree add
, it stops working right, so don't get into a bad habit here.
Some people like to run git commit -a
instead of git add -u
and then git commit
. This is OK, but I'd call that a bad habit too (though not as bad as lowercase head
). The main problem here is that git add -u
does not add new files, so git commit -a
won't either. (A secondary problem is that this kind of commit uses multiple active index files—multiple staging areas, internally—and some Git hooks get all discombobulated. That's a bug in such hooks, but it is a real problem.)
Summary
To make a new commit, you:
- switch to the branch you want (or create a new one and switch to it), which fills in Git's index and your working tree:
git switch
;
- create, modify, and/or delete files as desired (possibly including
.gitignore
);
- use
git status
and git add
and git rm
/ git rm --cached
as needed to make the git status
output right, looping through these steps until git status
looks good;
- optional run
git diff --staged
to view the changes you're about to commit as changes, rather than as a snapshot; and
- run
git commit
to commit what's in Git's index.
The optional git diff --staged
is the same git diff
that git status
ran, only without the --name-status
part. So this not only shows you which files are modified, but also what Git spots with its spot-the-difference code.