Git is different from most other version control systems (VCS).
Most VCS-es store "deltas" of various forms. For instance, if the tip-most commit in the entire repository is C9
as identified by master
and you extract that, you might get all the files in the repository as is, while if you extract C5
(previous commit from C9
), you'd start with all the latest files, and then C5
says "undo this, undo that, undo the other thing" and the version-control system undoes those and that gets you the state as of commit C5
.
Again, git does not do this.
Instead, git's repository stores what git calls "objects". There are four types of objects: "commits", "annotated tags", "trees", and "blobs". We'll ignore annotated tags (they are not needed for this purpose) and just consider the other three.
Each object has a unique, 160-bit name that gets represented as an SHA-1 hash. The value of the hash is constructed by computing the SHA-1 of the object's contents (plus its type). Git assumes that no two different objects in the repository will ever compute the same SHA-1 (if they do, git explodes messily; but this has never happened). (But note that the same object—e.g., the same foo.c
file in many commits—has one single unique SHA-1.)
A commit object looks like this:
$ git cat-file -p 5f95c9f850b19b368c43ae399cc831b17a26a5ac
tree 972825cf23ba10bc49e81289f628e06ad44044ff
parent 9c8ce7397bac108f83d77dfd96786edb28937511
author Junio C Hamano <gitster@pobox.com> 1392406504 -0800
committer Junio C Hamano <gitster@pobox.com> 1392406504 -0800
Git 1.9.0
Signed-off-by: Junio C Hamano <gitster@pobox.com>
That is, it has a tree
, a list of parent
s, an author
-and-date, a committer
-and-date, and a text message. That's all it has, too. Each parent
is the SHA-1 of the parent commit(s); a root commit has no parents, and a merge has multiple parents, but most commits just have one parent, which is what gives you the arrows in the diagram you posted.
A tree object looks like this:
$ git cat-file -p 972825cf23ba10bc49e81289f628e06ad44044ff
100644 blob 5e98806c6cc246acef5f539ae191710a0c06ad3f .gitattributes
100644 blob b5f9defed37c43b2c6075d7065c8cbae2b1797e1 .gitignore
100644 blob 11057cbcdf4c9f814189bdbf0a17980825da194c .mailmap
100644 blob 536e55524db72bd2acf175208aef4f3dfc148d42 COPYING
040000 tree 47fca99809b19aeac94aed024d64e6e6d759207d Documentation
100755 blob 2b97352dd3b113b46bbd53248315ab91f0a9356b GIT-VERSION-GEN
[snip lots more]
The tree gives you the top-level directory that goes with that commit. Most tree entries are blob
s; subdirectories are more tree
s. The mode
of a blob gives you the executable bit (these look like Unix file modes but git really uses only the one executable bit, so that the mode is always 100644
or 100755
). There are a few more modes for special cases (e.g., symlinks) but we can ignore them for now. In any case, each entry has yet another unique SHA-1, which is how git finds the next item (sub-tree or blob).
Each blob object contains the actual file. For instance, the blob
for GIT-VERSION-GEN
is the git version generator script:
$ git cat-file -p 2b97352dd3b113b46bbd53248315ab91f0a9356b
#!/bin/sh
GVF=GIT-VERSION-FILE
DEF_VER=v1.9.0
[snip]
So, to extract a commit, git needs only:
- translate a symbolic name like
HEAD
or master
to the commit's SHA-1
- extract the commit object to find the top-level tree
- extract the top-level tree object to find all the files and sub-trees
- for each file, extract the file object; and for each sub-tree, recursively extract that tree and its objects.
(Git objects are stored compressed, and are eventually further compressed into "pack files" which do use deltas, but in a very different way from other VCS-es. There's no need to delta-compress a file foo.c
against a previous version of foo.c
; git can delta-compress trees against each other, for instance, or some C code against some documentation. The exact pack file format has undergone several revisions as well: if some future version has an even better way to compress things, the pack format can be updated from version 4 to version 5, for instance. In any case, "loose" objects are just zlib-compressed rather than delta-compressed. This makes accessing and updating them quite fast. Pack files are used for more-static items—files that have not been modified—and for network transmission. They are built during git gc
, and also on push and fetch operations [which use a variant called a "thin" pack, when possible].)
For more of the git "plumbing" commands that allow you to read and write individual objects, see the Pro Git book (reminded from gatkin's answer).