OK, assuming you're good with graph theory and have read through Git for Computer Scientists:
- All graph nodes are identified by hash ID. (The hashes are currently SHA-1s over the object contents prefixed by type-plus-size-plus-a-NUL-byte, not that this matters too much except for correctness.)
- Except for so-called shallow clones, which we'll ignore here, every Git repository has every reachable commit and every reachable object underneath that commit.
- Commits become reachable by having a branch name, or tag name, or any other externally-sourced name, from a second database of reference names. Branch names must point to commit objects. (Other names, especially tag names, can point to other object types, but again that's not important for our purposes here.) Branch names are the most interesting case here since that's how we'll build commits and transfer them from one repository to another—but there's a second kind of name, the remote-tracking name, that is also key.
Hence we can draw the commit graph like this for a simple linear case:
A <-B <-C <--master
The external name master
contains the hash ID of commit C
. This commit object itself contains the hash ID of commit B
; B
contains the hash ID of A
. A
has no outgoing arcs (no parents as Git puts it) so it is a root commit and the action all stops here.
- All objects are read-only at all times. Unreachable objects can be garbage-collected.
Note that no internal object has the ID of C
, so the fact that C
is reachable, and hence the chain is retained, only occurs because of the external name refs/heads/master
(branch master
).
If we add a new branch name, such as dev
, we get:
A--B--C <-- master, dev
(the internal arrows in the graph are all still backwards because of the read-only nature of the commit objects, but this gets too painful to draw in text, so I don't bother). Now Git needs a way to know which name to adjust when making new commits, so it attaches the label HEAD
to some branch name. HEAD
can only be attached to a branch name! Let's draw that in:
A--B--C <-- master, dev (HEAD)
To make a new commit, Git packages up all the blob hashes stored in the index (which we've skipped over here) into a new tree object and then creates a new commit object pointing to the tree, as in discussed in the "Git for CSists" link above. The new commit will point back to whichever commit HEAD
indirectly points to:
A--B--C
\
D
and then Git will just overwrite whichever name HEAD
is attached to, so that the names are:
A--B--C <-- master
\
D <-- dev (HEAD)
Push and fetch transfer objects, then set names
We now have almost everything we need to understand both git push
and git fetch
. Let's look at push
first since you're more concerned with it.
Your Git will call up some other Git and hand over a hash ID, such as that for your new commit D
. The other Git has no name for this ID yet, it just checks to see if it has the hash ID. If not, it needs the commit (and probably the tree and blob objects as well), so it says "send me those objects". Your Git packs them up and sends them over. They put the commit D
and its sub-objects into their graph, but as yet they have no name for this object.
Now your Git sends a name, such as refs/heads/dev
. Their Git now looks to see if they can set this name. There are two cases:
They don't have a refs/heads/dev
branch: it's pretty safe to just create it, so they probably will. (You can set up fancy rules on the receiving side about what to allow or refuse, hence we can only say "probably" here.)
They do have a refs/heads/dev
: they'll check to see if changing it from whatever hash it has now, to point to commit D
, will keep all reachable commit objects still reachable. That's easy to do: is the commit to which their dev
points now an ancestor of D
? If so, the push is OK. If not, the push gets rejected as a non-fast-forward.
Using git fetch
is almost, but not quite, symmetric. When you git fetch
from some other repository, your Git has their Git list all their branch names and hash IDs, by default. Your Git then asks for all commits that they have that you don't, along with all their history that they have that you don't. At the end, though, instead of creating or adjusting local branch names, your Git sets up remote-tracking names such as origin/master
and origin/dev
.
(Technically these are refs/remotes/origin/master
, in a separate name space form branch names, so that there's no chance of collision. In practice, as long as you don't name your own branches origin/whatever
that's not a problem anyway.)
The last bit that's the most confusing initially
If your repository is initially created by cloning (which internally does a git fetch
) some existing repository, you start out with:
...--o--...--o <-- origin/master
\
o--...--o <-- origin/dev
and the like. These remote-tracking names make all the commits reachable, so that they're not all garbage collected. But then where does your local branch master
come from?
The trick is this: git checkout
will create a new local branch name out by deconstructing the renaming that git fetch
did with the other Git's branch names. We know that our origin/master
must have been their master
, so git checkout master
, when we don't have a master, will search for origin/master
. If that exists, Git just created a new label, in the branch name space, giving us:
...--o--...--o <-- master (HEAD), origin/master
\
o--...--o <-- origin/dev
To create your own new labels, use git branch <name> <hash-id-or-other-commit-specifier>
, which creates the refs/heads/<name>
name pointing to the given commit. You can also use git checkout -b <name> <specifier>
, which does the name creating and immediately does a git checkout
of the new name to attached HEAD
to it.
Cleanup
There's one more important bit to know, often not covered right away because this graph stuff overwhelms everyone. This is refspec syntax. Both git push
and git fetch
use refspecs, although they treat them differently.
A refspec is mostly just a pair of reference names separated by a colon. The name on the left is the source name and the name on the right is the destination name. There's also an optional leading plus sign, which means "force": update the name even if the update is not a fast-forward.
When you use git fetch origin
, the default refspec is:
+refs/heads/*:refs/remotes/origin/*
This means that your Git matches the other Git's branch names (refs/heads/*
), but turns all those names into your own remote-tracking names (under refs/remotes/
and furthermore under origin/
—this leaves room for additional remotes).
If you omit the destination name in a git fetch
command, Git doesn't write any names. This leaves the fetched commits subject to garbage collection, but there's a delay, because Git first does record the hash IDs into .git/FETCH_HEAD
, where you can retrieve them and where they act as temporary retainers. The FETCH_HEAD
contents get overwritten by the next fetch, so they are not as good as real names in the name-to-ID database.
For git push
, however, the default is generally to push a branch name to the same name on the other Git. That is, git push origin master
really means git push origin master:master
(and Git fills in the refs/heads/
part when it discovers that master
is a branch name). For more on how Git looks up these names, see the gitrevisions documentation.