Note: this is quite long, but you really need to know these things. I've run out of space (there's a 30k limit on characters) so I'll break this into two separate answers. Part 2 is here; part 3 here.
While "pull requests" are not part of Git (they're specific to GitHub 1), there are some things we can say about them even without referring specifically to GitHub. Then we can plug in GitHub-specific items later. So let's start with this:
Git is all about commits. While Git commits contain files, Git isn't really about the files, but rather about the commits. And, while we use branch names to find commits, Git isn't really about branch names either: it's really just about the commits.
This means you need to know all about commits: what one is and what each commit, and a string of commits in a row, can do for you.
So we'll start with a quick overview of a commit, and then look at a string of them in a row.
1Bitbucket also has "pull requests", but they're very slightly different, and GitLab has "merge requests", which are again same-but-different. All of these build on the same base support in Git proper.
Commits
Each Git commit is numbered. The numbers are not simple sequential counting numbers, though: we don't have commit #1 followed by #2 and #3 and so on. Instead, each commit gets a unique hash ID—unique across all repositories everywhere, even if they're not related to your repository at all2—that seems random, but isn't.3 A hash ID is big, ugly, and impossible for humans to work with: computers can handle them, but our feeble brains become confused. So, below, I'll use fake hash IDs where I just use a single uppercase letter to stand in for a real hash ID. Note that for these hash IDs to work, every part of a commit has to be entirely read-only. That is, once you make a new commit, that commit is frozen in time forever. That particular hash ID, whatever hash ID it got, is for that commit, and no other commit—past, present, or future—can ever use that hash ID.
In any case, each Git commit stores two things:
A commit stores a full snapshot of every file (that Git knew about at the time you, or whoever, made it, anyway). To keep the repository from becoming hugely fat, these files are (a) compressed and (b) de-duplicated. As such, they're stored in a format that only Git can read, and nothing, not even Git itself, can overwrite. As we'll see, this solves some problems but creates one big one.
A commit also stores some metadata, or information about the commit itself. This includes, for instance, the name and email address of the person who made the commit (from their user.name
and user.email
settings, which they can change any time they like, so it's not reliable without verification, but it is still useful). It includes a log message: when you supply one for your own commits, you should write up an explanation of why you made the commit. What you did—such as change one instance of 7 to 14—is something Git can show on its own, but why did you change 7 to 14? Was it to go from weeks to fortnights, or was it because the 7 Dwarfs were all cloned?
Inside the metadata for a commit, Git adds, for its own purposes, a list of raw hash IDs for previous commits. This list is usually just one element long: for a merge commit (which we won't cover here) it's two elements long, and at least one commit in any non-empty repository is the very first commit, where there aren't any previous commits, so that this list is empty.
2This is why the hash IDs have to be so big and ugly. They don't, strictly speaking, have to be unique across two repositories that won't ever meet, but Git does not know whehther or when two repositories might meet each other in the future, and if two different commits have the same hash ID at that time, bad things happen. I call such a commit a Doppelgänger, a sort of evil twin that's a harbinger of disaster. The actual disaster is—or at least should be—just that the meeting of those two Git repositories fails. In some very old versions of Git, worse things actually did happen, due to bugs. In any case it's just not supposed to happen at all, and the size of the hash helps avoid that.
3Current hashes are SHA-1 checksums of all the data in the commit, which includes data about the commits leading up to the commit, hence it's a checksum of the entire history leading up to that point. SHA-1 is no longer cryptographically secure. Though this does not break Git by itself, Git is moving to SHA-256.
Chains of commits
Given the above, we can draw the three commits in a tiny little three-commit repository like this:
A <-B <-C
Commit C
is our third and latest-so-far commit. It has some random-looking hash ID, and a snapshot of all the files. One or two files in C
differ, probably, from all the files in earlier commit B
, and the rest are the same as in B
and are therefore literally shared with earlier commit B
. So they don't take any actual space. The modified files do take some space, but they're compressed—sometimes very compressed—and might take hardly any space. There's a little space for the commit metadata (which is also compressed, by the way), but overall, this full-snapshot-of-every-file probably doesn't take much space.
Meanwhile, commit C
contains the raw hash ID of earlier commit B
. We say that C
points to B
. This means that if Git can find C
—we'll see how it can do that in a moment—Git can use the hash ID in C
to find B
too. Git can then extract, from both commits, all the files in the two snapshots, and compare them. The result of comparing the files is a diff: instructions for changing the files in B
into the files in C
(or vice versa, if you have the diff done in the other order).
Git, and sites like GitHub, will generally show a commit as a diff, as that's often more useful than showing the raw snapshot. But you can easily get the snapshot instead, if you like: that's sometimes easier for Git than getting the diff. (Because of the de-duplication trick, git diff
can quickly skip over files that are the same, but it still has to look at two commits, not just one. So it's kind of mixed as to which is easier.)
Commit B
, being a commit, has both snapshot and metadata, and points backwards to still-earlier commit A
. But commit A
is the first commit, so its metadata doesn't list any earlier commit. That means that all the files in its snapshot are new, by definition. (They'd be compressed and de-duplicated against any files in any other commit, but back then, it was the first commit, so they're only compressed and de-duplicated against themselves. This last means that if the first commit contains 100 identical copies of a big file, there's really only one copy in commit A
.)
Branch names and other names
Git needs a fast way to find the last commit in some chain. Git could force us—the humans using Git—to write down the hash ID of the last commit, in this case C
. We could save that on paper, or a whiteboard, or something. But that's silly: we have a computer. Why not have the computer save these hash IDs in a file or something? In fact, why not have Git save the most recent hash ID for us?
That's exactly what a branch name is: a place to save the hash ID of the latest commit. Git only needs the latest one, because the latest points back to the second-latest, which points back to a still-earlier one, and so on. This goes on as long as possible, ending only when there is no earlier commit, and that's how Git works: it starts from a commit we tell it about—usually by branch name—and works backwards.
Let's draw a simple chain of commits ending in hash ID H
(for Hash), and have the branch name main
point to (contain the hash ID of) H
:
...--G--H <-- main
Now let's add a new name, like feature1
. This name has to point to some existing commit. We could pick G
, or H
, or some earlier commit, but it seems kind of natural to pick H
as it's our latest:
...--G--H <-- feature, main
Note that Git has lots of kinds of names—not just branch names—and they all do this sort of thing, i.e., point to a commit. So we can make a tag that points to commit H
, for instance:
...--G--H <-- feature, main, tag: v1.0
Mostly, though, we'll just use branch names, and that's all I'll show here for now.
Doing work on a branch
Git has its own special features for letting us do work. The contents of a commit snapshot are, as we noted earlier, frozen for all time, and only readable by Git itself. So we can't actually work on / with these files, contained in the commit. We have to get Git to extract the files somewhere. That "somewhere" is our working tree or work-tree.
Git also has a very important thing, which Git gives three names: the index, the staging area, and sometimes the cache. We won't cover that here, except to note that when you run git commit
, Git actually makes the new commit from the files in Git's index / the-staging-area, not from the files in your working tree. All the files to be committed must be in the staging area: these are the files that Git knows about. Extracting a commit copies the commit's files to the staging area, as well as to the working tree, so that they are there to start with.
In any case, once the files are in your working tree, they are just ordinary files on your computer. They aren't in Git any more. They came out of Git (out of a commit), and you can put them back into Git in a new commit later, but while you do your work, you work on and with files that are not in Git. Only the committed files are in Git.
You do your work with your working-tree files and run git add
as usual. (This copies the working tree version of the files you list back into the index, so that they're ready to be committed. It's during the git add
stage that Git does the initial compression and de-duplication. The files as seen in Git's index are pre-de-duplicated, in other words. This means the index's copies mostly take no space, except for any file's you've changed-and-added. You can add an unchanged file: this is just a mild waste of time as Git will discover that it's a duplicate and just retain the original. It's a waste of cheap computer time, not valuable human time, so feel free to waste it! But if you know some file is enormous and that this will waste your time too, feel free to skip it.)
In any case, now that your new commit is ready, you run git commit
. This:
- gathers any necessary metadata, such as your name and email address and the current date and time;
- gets the hash ID of the current commit—the one you checked out to fill your working tree (and Git's index) earlier;
- freezes the index's snapshot; and
- writes all this out as a new commit, which gets a new, unique hash ID.
If you had:
...--G--H <-- feature, main
just a moment ago, then your current commit was H
, so your new commit—which we'll call I
—points back to H
:
I
/
...--G--H
Git does, however, need to know which branch name you were using to find H
. So one of those two names has the special name HEAD
"attached to it". Let's say that this name was and still is feature
. Then our drawing now looks like this:
I <-- feature (HEAD)
/
...--G--H <-- main
That is, Git used HEAD
to find the name feature
, first to find hash ID H
, and now to write new hash ID I
into feature
.
The effect of this is that the current branch name, whatever it is, now points to the new commit you just made. (Note that the snapshot in I
used the index / staging-area, which you updated to match your working tree, so all three match now, just like they did when you started with a "clean" checkout or git switch
.) If you make another new commit with the usual modify-files-add-and-commit process, you get:
I--J <-- feature (HEAD)
/
...--G--H <-- main
If you now git switch main
or git checkout main
, what Git does is:
- rip out all the commit-
J
files and replace them with the commit-H
files; and
- attach the special name
HEAD
to main
.
You now have:
I--J <-- feature
/
...--G--H <-- main (HEAD)
You are on branch main
, as git status
will say, and your working tree and staging area are "clean" (match the H
commit), with your updated files safely saved forever—or for as long as the commit itself lasts—in commit J
, which you can find using the name feature
.
If you like, you can now create a new branch, such as feature2
, and switch to it (using git branch
and git switch
, or the combined git switch -c
to do it all at once):
I--J <-- feature
/
...--G--H <-- feature2 (HEAD), main
As you make new commits on this new branch, the branch name automatically updates to point to the latest commit:
I--J <-- feature
/
...--G--H <-- main
\
K--L <-- feature2 (HEAD)
Note that commits up through and including H
are, in Git's terms, on all three branches. Commits I-J
are currently only on feature
and commits K-L
are only on feature2
. Commit H
is the latest commit on main
, though it's not the latest commit ever (that's commit L
in your repository, at this point). Moreover, there's no direct relationship between commits J
and L
: they're just cousins, as it were. They are children of children of a common grandparent, H
.
Merging
To understand what's going to happen, we now need to look at the usual harder-case for merging. Git has a shortcut for an easy case, but for various reasons (some good, some less good), GitHub in particular never use this shortcut. The easy case is easier to see once you understand the more general case anyway.
In Git, using git merge
is about combining work. Let's draw the two feature branches without drawing in the name main
(it may still exist, it's just in the way of what I want to draw). Let's switch to branch feature
first:
I--J <-- feature (HEAD)
/
...--G--H
\
K--L <-- feature2
Our current commit is now J
, and we'll find J
's files in our working tree right now. We now run git merge feature2
, and git merge
:
- locates commit
J
(easy: just read HEAD
and then feature
);
- locates commit
L
(also easy: feature2
contains the right hash ID);
- locates the best common starting point commit.
That last part can be hard, although here it's really easy to see that this is commit H
: the grandfather of both J
and L
. If Git now compares the snapshot in H
to the snapshot in J
, Git will produce a recipe that contains all the work you did on feature
:
git diff --find-renames <hash-of-H> <hash-of-J> # what "we" did
By running a second diff from H
to L
, Git will produce a recipe that contains all the work done on feature2
:
git diff --find-renames <hash-of-H> <hash-of-J> # what "they" did
It doesn't really matter who did which work, at this point: the only things that matter are which files "we" changed, which ones "they" changed, and what changes we made to each of these files. The two git diff
s figure this out.
If Git can combine these two sets of changes on its own, it can then apply the combined changes to the snapshot from H
. However you like to look at it, this either preserves our changes and adds theirs, or adds together both changes, or whatever. The end result, Git assumes, is the correct snapshot to store in a new commit.
If Git can't combine these changes on its own, Git will stop in the middle of the merge with a merge conflict. The programmer must now come up with the correct result. We'll skip right over this part. We'll just assume that Git came up with the right result all on its own. In that case git merge
goes on to run git commit
for you.
Normally, the resulting commit M
would have commit J
as its parent. Our new merge commit does in fact have J
as a parent—the first parent—but also has commit L
, the commit we named on the git merge
command line, as its second parent, like this:
I--J
/ \
...--G--H M <-- feature (HEAD)
\ /
K--L <-- feature2
The name feature
, to which HEAD
is attached, moves as usual to point to new commit M
. But since M
points backwards to both J
and L
, commits K-L
are now also "on" branch feature
. This means all commits up through M
are on feature
, while feature2
still ends at L
and does not contain commits I-J
.
We can, if we want, delete the name feature2
now: it's only useful to find L
directly, and if we don't feel the need to find L
directly, we can find it by looking at the second parent of M
, whenever we care. If we'd like to add more commits to feature2
now, we should hang on to the name and do that:
I--J
/ \
...--G--H M <-- feature
\ /
K--L--N--O <-- feature2 (HEAD)
We can now merge feature2
into feature
again if we like:
I--J
/ \
...--G--H M-----P <-- feature (HEAD)
\ / /
K--L--N--O <-- feature2
making a sort of duck's head picture, though we could redraw this without the lump along the top row too:
...--G--H--I--J--M-----P <-- feature (HEAD)
\ / /
K----L--N--O <-- feature2
(not sure what this one looks like).
Fast-forwarding
The special short-cut case Git has for git merge
applies in cases like this one:
...--D--E <-- main (HEAD)
\
F--G <-- bugfix
If we run git merge bugfix
, Git will locate commits E
and G
, and then find the merge base of E
and G
: the best commit that's on both branches. But that's commit E
itself, i.e., the current commit.
Git could go ahead and diff E
against itself, to find no changes. Then it could diff E
against G
to find their changes. Then it would apply those changes to E
and come up with a new commit H
, and give it two parents:
...--D--E------H <-- main (HEAD)
\ /
F--G <-- bugfix
Commit H
would be a merge commit, with two parents, just like the "real merge" case. But obviously diffing E
against itself is silly, and adding their changes just gets us a commit H
whose snapshot exactly matches the snapshot in their commit G
. So Git will, for this case, not bother merging at all unless we tell it to.
Instead, Git will do what it calls a fast-forward merge. What that means is that Git simply checks out commit G
directly, while dragging the current branch name forward:
...--D--E
\
F--G <-- bugfix, main (HEAD)
There's now no reason to draw the kink in the graph at all:
...--D--E--F--G <-- bugfix, main (HEAD)
and deleting the name bugfix
is obviously safe enough, though presumably main
will advance further later.
To suppress the fast-forward-instead-of-merge thing, we would run git merge --no-ff
. GitHub effectively always do this, so you won't see fast-forward merges occur on GitHub; but it's good to know about them.
When to delete a name
When and whether to delete the other branch name is up to the user. Note that deleting the name does not delete the commits: it only makes it harder to find them. But there is another thing to know. Suppose we have:
...--G--H <-- main
\
I--J <-- bugfix (HEAD)
where commits I
and J
simply don't actually work. You'll run:
git switch main
git branch -d --force bugfix
to discard your attempt to fix the bug. This leaves you with:
...--G--H <-- main
\
I--J ???
Commits I-J
still exist, but unless you wrote down J
's hash ID, you may never be able to find commit J
again.
Git will—eventually—detect that commit J
is unreachable (that there's no way for you to find it) and will delete it for real. The same goes for commit I
once J
is gone. You get a grace period, normally at least 30 days, during which Git won't do this, and various Git commands to help find accidentally-lost commits. But if you don't bother finding them and adding a name back, the "reflog entries" by which Git keeps track of "lost" commits like this eventually expire, and then—when Git gets around to doing its maintenance and janitorial work—the "lost" commits will really go away from this repository. So, while commits are read-only, they are only "mostly permanent". They remain in your repository as long as you can find them (and then a little bit longer).
Clones, remotes, and multiple repositories
Git is not just a Version Control System (VCS); it's a Distributed VCS (DVCS). The way Git does this distribution is to allow for—or rather, strongly encourage—many copies of a repository to exist. As such, a Git repository is:
- a collection of commits and other Git objects, some or all of which may be in other repositories too; and
- a collection of names, such as branch and tag names, that help you (and Git) find the commits and other internal objects.
These are stored as two simple key-value databases. The keys in the names database are branch names like refs/heads/main
, tag names like refs/tags/v1.2
, and many other kinds of names. Each name lives in a namespace under refs/
. Each name stores exactly one hash ID.
The keys in the objects database are hash IDs. Each object in this database has some Git internal object type (commit, tree, blob, or annotated tag). The commit objects, along with supporting tree and blob objects, wind up storing your files; and you will mostly just work with the commits and don't normally have to care much at all about these details.
Since commit hash IDs are globally unique, the object database keys in your clone of some repository are the same as the keys in every other clone of that same repository. When you clone a repository, you get all, or almost all, of their commits and supporting objects. But the names database in your clone is entirely separate from theirs.
What this means is that a clone of a repository starts out with no branch names at all. You run:
git clone <url>
or:
git clone -b <branch> <url>
and your Git software creates a new, totally-empty Git repository to start. Your Git software, using your Git repository (I like to shorten this to "your Git") calls up their Git software and points it to their Git repository ("their Git"). Their Git lists out all their branch and tag and other names and the hash IDs that go with them, and your Git then asks for the objects it would like to copy (normally, all of them). For each commit you're going to get, their Git is obligated to offer all of that commit's parents, and the parents' parents, and so on. So you end up copying every commit into your Git.
Now that you have all the commits (and supporting objects), your Git takes each of their branch names and renames them. This renaming process makes use of the concept of a "remote".
A remote, in Git, is just a short name that stores at least a URL (you can have it store various extra features later). The URL is the one you type into git clone
, and the name of the first "remote" is always origin
.4 So origin
from now on means the URL I cloned from, unless and until you change something.
Git uses this name—the origin
string—to make up new names for their branch names. Their main
becomes your origin/main
; their debug
becomes your origin/debug
; if they have a feature/tall
, you get an origin/feature/tall
; and so on. These names are not actually branch names; I like to call them remote-tracking names.5 Their function is to remember, for your Git repository, what their branch names are, and what commit each of those names selected, the last time your Git got an update from their Git.
Once this renaming is done, your Git has created remote-tracking names for every branch name they have. You have all of their commits, and can find all of them because your remote-tracking names hold the same hash IDs as their branch names, that they're using to find their commits.
Now, shortly before your git clone
finishes and returns control to you so that you can begin working, your Git:
- Creates one new branch name in your repository, from the
-b
argument you gave: if you said -b bugfix
, your Git finds your origin/bugfix
which corresponds to their bugfix
and creates your own bugfix
, pointing to the same commit.
- Checks out (switches to) this new branch.
So now your clone has one branch in it, matching one of their branches. If you don't use -b
, your Git asks their Git what name they recommend. The usual standard recommendation is their main branch (now normally main
; in the past this was master
).
Once you have a clone, you can add more remotes, using git remote add
. This needs a name for the remote, and a URL; it sets up the remote but does not yet run git fetch
. It's time now to talk about fetching and pushing; see the other answer.
4You can choose some other name, but there's almost never any point to doing so. Use origin
as the name of the "main remote". You can rename a remote at any point, so even if you don't intend to keep the starting URL, it works fine to let git clone
default to origin
here.
5Git calls them remote-tracking branch names, beating the poor overloaded word branch
from bloody, misshapen beast to barely-recognizable-splotch. Seriously, just drop the word branch here, it doesn't help any.