It's my understanding that the loose objects stored in a [bare] git repository are compressed...
They are. But they are zlib-deflate compressed.
...so why does git pack-objects (and all the related repack and gc commands) have a really long Compressing objects stage?
These commands—git pack-objects
and git repack
, anyway; git gc
just runs git repack
for you—combine many objects into one pack file.
A pack file is a different way of compressing objects. A loose object is standalone: Git needs only to read the loose object and run a zlib inflate pass over it to get the uncompressed data for that object. A pack file, by contrast, contains many objects, with some-to-many of those objects being, first, delta-compressed.
Delta compression works by saying, in effect: To produce this object, first produce that other object. Then add these bytes here and/or remove N bytes here. Repeat this adding and/or removing until I'm done with a list of deltas. (The delta instructions themselves can then be zlib-deflated as well.) You might recognize this as a sort of diff, and in fact, some non-Git version control systems really use diff, or their own internal diff engine, to produce their delta-compressed files.
Traditionally, this uses the observation that some file (say, foo.cc
or foo.py
) tends to change over time by adding and/or removing a few lines somewhere in the file, but keeping the bulk of it the same. If we can say: take all of the previous version, but then add and/or remove these lines, we can store both versions in much less space than it takes to store one of them.
We can, of course, build a delta-compressed file atop a previous delta-compressed file: Take the result of expanding the previous delta-compressed file, and apply these deltas. These make delta chains, which can be as long as you like, perhaps going all the way back to the point at which the file is first created.
Some (non-Git) systems stop here: each file is stored as either a change to the previous version, or, every time you store a file, the system stores the latest, and turns the previous full copy (which used to be the latest, and hence was the full copy) into the delta needed to convert "latest" to "previous". The first method is called forward delta storage, while the second is of course reverse delta storage. Forward deltas tend to be at a terrible disadvantage in that extracting the latest version of a file requires extracting the first version, then applying a very long sequence of deltas, which takes ages. RCS therefore uses reverse deltas, which means that getting the latest version is fast; it's getting a very old version that's slow. (However, for technical reasons, this only works on what RCS calls the trunk. RCS's "branches" use forward deltas instead.) Mercurial uses forward deltas, but occasionally stores a new full copy of the file, so as to keep the delta chain length short. One system, SCCS, uses a technique that SCCS calls interleaved deltas, which gives linear time for extracting any file (but is harder to generate).
Git, however, doesn't store files as files. You already know that file data is stored as a blob object, which is initially just zlib-deflated, and otherwise intact. Given a collection of objects, some of which are file data and some of which aren't (are commits, trees, or annotated tag objects), it's not at all obvious which data belong to which file(s). So what Git does is to find a likely candidate: some object that seems to resemble some other object a lot is probably best expressed by saying start with the other object, then make these delta changes.
Much of the CPU time spent in compression lies in finding good chains. If the version control system picks files (or objects) poorly, the compression will not be very good. Git uses a bunch of heuristics, including peeking into tree objects to reconstruct file names (base names only—not full path names), since otherwise the time complexity gets really crazy. But even with the heuristics, finding good delta chains is expensive. Exactly how expensive is tunable, via "window" and "depth" settings.
For (much) more about pack files, which have undergone several revisions over time, see the Documentation/technical directory in Git.