12

I have been reading the git book. In this book I learned that git functions through taking snapshots of the files you work with, instead of deltas like other VCSs. This has some excellent benefits.

However, this leaves me wondering: over time, shouldn't the .git/ folder containing these snapshots blow up to be too large? There are repositories that have 10,000+ commits or more, with hundreds of files. Why doesn't git blow up in size?

  • Not precisely a duplicate, but explains where Git sneaks in delta-compression: https://stackoverflow.com/q/28222703/1256452 – torek Aug 16 '18 at 18:24
  • Possible duplicate of [How does git store files?](https://stackoverflow.com/questions/8198105/how-does-git-store-files) – phd Aug 16 '18 at 20:33
  • https://stackoverflow.com/questions/33455666/git-why-exactly-is-the-claim-git-is-based-on-differences-between-files-wrong – phd Aug 16 '18 at 20:34

1 Answers1

21

The trick here is that this claim:

git functions through taking snapshots of the files you work with, instead of deltas like other VCSs

is both true and false!

Git's main object database—a key-value store—stores four object types. We don't need to go into all the details here; we can just note that files—or more precisely, files' contents—are stored in blob objects. Commit objects then refer (indirectly) to the blob objects, so if you have some file content named bigfile.txt and store it in 1000 different commits, there's only one object in all of those commits, re-used 1000 times. (In fact, if you rename it to hugefile.txt without changing its content, new commits continue to re-use the original object—the name is stored separately, in tree objects.)

That's all fine, but over time, most files in most projects do accumulate changes. Other VCSes will, instead of storing a whole new copy of each file, make use of delta encoding to avoid storing every version of every file separately. If a blob object is a complete, intact (albeit zlib-deflated) file, your question boils down to this: wouldn't the accumulation of separate blob objects make the object database grow much faster than a VCS that uses delta compression?

The answer is that it would, but Git does use delta compression. It just does it below the level of the object database. Objects are logically independent. You give Git the key—the hash ID—for some object, and you get the entire object back. But only so-called loose objects are stored as a simple zlib-deflated file.

As Jonathan Brink noted, git gc cleans up unused objects. This does not help with retained objects, such as older versions of hugefile.txt or whatever. But git gc—which Git runs automatically whenever Git thinks it might be appropriate—does more than just prune unreferenced objects. It also runs git repack, which builds or re-builds pack files.

A pack file stores multiple objects, and inside a pack file, objects are delta-compressed. Git pores over the collection of all objects that will go into a single pack file, and for all N objects, picks some set B of them to use as delta bases. These object are merely zlib-deflated. The remaining N-B objects are encoded as deltas, against either the bases, or against earlier delta-encoded objects that use those bases. Hence, given a key for an object stored in a pack file, Git can find the stored object or delta, and if what is stored is a delta, Git can also find the underlying objects, all the way down to the delta bases, and hence extract the complete object.

Hence, Git does use delta encoding, but only within a pack file. It's also based not on files but rather on objects, so (at least in theory) if you have huge trees, or long texts inside commit messages, those can be compressed against each other as well.

Even this is not quite the whole story though: for transmission over networks, Git will build so-called thin packs. The key difference between a regular pack and a thin pack has to do with those delta bases. Given a regular pack file and a hash ID, Git can always retrieve the complete object from that file alone. With a thin pack, however, Git is allowed to use objects that are not in that pack file (as long as the other Git, to which the thin-pack is being transported, has claimed that it has those objects). The receiver is required to "fix" the thin pack on receipt, but this allows git fetch and git push to send deltas rather than complete snapshots.

torek
  • 448,244
  • 59
  • 642
  • 775