So, I’m going to expand on the topic a bit and explain how Git stores what. Doing so will explain what information is stored, and what exactly matters for the size of the repository. As a fair warning: this answer is rather long :)
Git objects
Git is essentially a database of objects. Those objects come in four different types and are all identified by a SHA1 hash of their contents. The four types are blobs, trees, commits and tags.
Blob
A blob is the simplest type of objects. It stores the content of a file. So for each file content you store within your Git repository, a single blob object exists in the object database. As it stores only the file content, and not metadata like file names, this is also the mechanism that prevents files with identical content from being stored multiple times.
Tree
Going one level up, the tree is the object that puts the blobs into a directory structure. A single tree corresponds to a single directory. It is essentially a list of files and subdirectories, with each entry containing a file mode, a file or directory name, and a reference to the Git object that belongs to the entry. For subdirectories, this reference points to the tree object that describes the subdirectory; for files, this reference points to the blob object storing the file contents.
Commit
Blobs and trees are already enough to represent a complete file system. To add the versioning on top of that, we have commit objects. Commit objects are created whenever you commit something in Git. Each commit represents a snapshot in the history of revisions.
It contains a reference to the tree object describing the root directory of the repository. This also means that every commit that actually introduces some changes at least requires a new tree object (likely more).
A commit also contains a reference to its parent commits. While there is usually just a single parent (for a linear history), a commit can have any number of parents in which case it’s usually called a merge commit. Most workflows will only ever make you do merges with two parents, but you can really have any other number too.
And finally, a commit also contains the meta data you expect a commit to have: Author and committer (name and time) and of course the commit message.
That is all that is necessary to have a full version control system; but of course there is one more object type:
Tag
Tag objects are one way to store tags. To be precise, tag objects store annotated tags, that are tags that have—similar to commits—some meta information. They are created by git tag -a
(or when creating a signed tag) and require a tag message. They also contain a reference to the commit object they are pointing at, and a tagger (name and time).
References
Up until now, we have a full versioning system, with annotated tags, but all our objects are identified by their SHA1 hash. That’s of course a bit annoying to use, so we have some other thing to make it easier: References.
References come in different flavors, but the most important thing about them is this: They are simple text files containing 40 characters—the SHA1 hash of the object they are pointing to. Because they are this simple, they are very cheap, so working with many references is no problem at all. It creates no overhead and there is no reason not to use them.
There are usually three “types” of references: Branches, tags and remote branches. They really work the same and all point to commit objects; except for annotated tags which point to tag objects (normal tags are just commit references though too). The difference between them is how you create them, and in which subpath of /refs/
they are stored. I won’t cover this now though, as this is explained in nearly every Git tutorial; just remember: References, i.e. branches, are extremely cheap, so don’t hesitate to create them for just about everything.
Compression
Now because torek mentioned something about Git’s compression in his answer, I want to clarify this a bit. Unfortunately he mixed a few things up.
So, usually for new repositories, all Git objects are stored in .git/objects
as files identified by their SHA1 hash. The first two characters are stripped from the filename and are used to partition the files into multiple folders, just so it gets a bit easier to navigate.
At some point, when the history gets bigger or when it is triggered by something else, Git will start to compress objects. It does this by packing multiple objects into a single pack file. How this exactly works is not really that important; it will reduce the amount of individual Git objects and efficiently store them in single, indexed archives (at this point, Git will use delta compression btw.). The pack files are then stored in .git/objects/pack
and can easily get a few hundred MiB in size.
For references, the situation is somewhat similar, although a lot simpler. All current references are stored in .git/refs
, e.g. branches in .git/refs/heads
, tags in .git/refs/tags
and remote branches in .git/refs/remotes/<remote>
. As mentioned above, they are simple text files containing only the 40 character identifier of the object they are pointing at.
At some point, Git will move older references—of any type—into a single lookup file: .git/packed-refs
. That file is just a long list of hashes and reference names, one entry per line. References that are kept in there are removed from the refs
directory.
Reflogs
Torek mentioned those as well, reflogs are essentially just logs for references. They keep track of what happens to references. If you do anything that affects a reference (commit, checkout, reset, etc.) then a new log entry is added simply to log what happened. It also provides a way to go back after you did something wrong. A common use case for example is to access the reflog after accidentally resetting a branch to somewhere it wasn’t supposed to go. You can then use git reflog
to look at the log and see where the reference was pointing at before. As loose Git objects are not immediately deleted (objects that are part of the history are never deleted), you can usually restore the previous situation easily.
Reflogs are however local: They only keep track of what happens to your local repository. They are not shared with remotes, and are never transferred. A freshly cloned repository will have a reflog with a single entry, it being the clone action. They are also limited to a certain length after which older actions are pruned, so they won’t become a storage problem.
Some final words
So, getting back to your actual question. When you clone a repository, Git will usually already receive the repository in a packed format. This is already done to save transfer time. References are very cheap, so they are never the cause of big repositories. However, because of Git’s nature, a single current commit object has a whole acyclic graph in it that eventually will reach the very first commit, the very first tree, and the very first blob. So a repository will always contain all the information for all revisions. That is what makes repositories with a long history big. Unfortunately, there is not really much you can do about it. Well, you could cut off older history at some part but that will leave you with a broken repository (you do this by cloning with the --depth
parameter).
And as for your second question, as I explained above, branches are just references to commits, and references are only pointers to Git objects. So no, there is not really any metadata about branches you can get from them. The only thing that might give you an idea is the first commit you made when branching off in your history. But having branches does not automatically mean that there is actually a branch kept in the history (fast-foward merging and rebasing works against it), and just because there is some branching-off in the history that does not mean that the branch (the reference, the pointer) still exists.