2

It's my understanding that the loose objects stored in a [bare] git repository are compressed...

...so why does git pack-objects (and all the related repack and gc commands) have a really long Compressing objects stage? Shouldn't it just be copying them?

For example:

objects/75/f0debd8e421ab3f9cc8b6aeb539796ae86b705 is already compressed. In the pack file, this file should be byte wise copied into the spot immediately after its header because the pack-file format specifies that the compressed data goes there... So why does it need to be re-compressed if it's already compressed?

If it's perhaps trying to use a different compression... how can I tell it not to, and instead just use the file as-is?

Updated notes:

  • I have set the settings and options such that delta compression effectively does not happen. Delta compression is not useful for storing 2 TB of .NEF images.
iAdjunct
  • 2,739
  • 1
  • 18
  • 27

2 Answers2

6

It's my understanding that the loose objects stored in a [bare] git repository are compressed...

They are. But they are zlib-deflate compressed.

...so why does git pack-objects (and all the related repack and gc commands) have a really long Compressing objects stage?

These commands—git pack-objects and git repack, anyway; git gc just runs git repack for you—combine many objects into one pack file.

A pack file is a different way of compressing objects. A loose object is standalone: Git needs only to read the loose object and run a zlib inflate pass over it to get the uncompressed data for that object. A pack file, by contrast, contains many objects, with some-to-many of those objects being, first, delta-compressed.

Delta compression works by saying, in effect: To produce this object, first produce that other object. Then add these bytes here and/or remove N bytes here. Repeat this adding and/or removing until I'm done with a list of deltas. (The delta instructions themselves can then be zlib-deflated as well.) You might recognize this as a sort of diff, and in fact, some non-Git version control systems really use diff, or their own internal diff engine, to produce their delta-compressed files.

Traditionally, this uses the observation that some file (say, foo.cc or foo.py) tends to change over time by adding and/or removing a few lines somewhere in the file, but keeping the bulk of it the same. If we can say: take all of the previous version, but then add and/or remove these lines, we can store both versions in much less space than it takes to store one of them.

We can, of course, build a delta-compressed file atop a previous delta-compressed file: Take the result of expanding the previous delta-compressed file, and apply these deltas. These make delta chains, which can be as long as you like, perhaps going all the way back to the point at which the file is first created.

Some (non-Git) systems stop here: each file is stored as either a change to the previous version, or, every time you store a file, the system stores the latest, and turns the previous full copy (which used to be the latest, and hence was the full copy) into the delta needed to convert "latest" to "previous". The first method is called forward delta storage, while the second is of course reverse delta storage. Forward deltas tend to be at a terrible disadvantage in that extracting the latest version of a file requires extracting the first version, then applying a very long sequence of deltas, which takes ages. RCS therefore uses reverse deltas, which means that getting the latest version is fast; it's getting a very old version that's slow. (However, for technical reasons, this only works on what RCS calls the trunk. RCS's "branches" use forward deltas instead.) Mercurial uses forward deltas, but occasionally stores a new full copy of the file, so as to keep the delta chain length short. One system, SCCS, uses a technique that SCCS calls interleaved deltas, which gives linear time for extracting any file (but is harder to generate).

Git, however, doesn't store files as files. You already know that file data is stored as a blob object, which is initially just zlib-deflated, and otherwise intact. Given a collection of objects, some of which are file data and some of which aren't (are commits, trees, or annotated tag objects), it's not at all obvious which data belong to which file(s). So what Git does is to find a likely candidate: some object that seems to resemble some other object a lot is probably best expressed by saying start with the other object, then make these delta changes.

Much of the CPU time spent in compression lies in finding good chains. If the version control system picks files (or objects) poorly, the compression will not be very good. Git uses a bunch of heuristics, including peeking into tree objects to reconstruct file names (base names only—not full path names), since otherwise the time complexity gets really crazy. But even with the heuristics, finding good delta chains is expensive. Exactly how expensive is tunable, via "window" and "depth" settings.

For (much) more about pack files, which have undergone several revisions over time, see the Documentation/technical directory in Git.

torek
  • 448,244
  • 59
  • 642
  • 775
  • Thank you for the very verbose writeup. I updated the question with a rather important detail: I've completely disable delta compression. Also, the pack-file stores each file individually compressed (i.e. the headers in between are not within the compression). So, if `objects/00/112233....` is compressed, and that file is *verbatim* put into the correct spot in the pack file (after its headers), then why does it need to be re-compressed? – iAdjunct Jan 12 '19 at 16:48
  • Aha. How, exactly, did you disable all delta-compression? (Git probably—I haven't checked the source, which is pretty twisty—still runs a separate zlib deflate over the object that it first read from the loose object and inflated, since the "don't deltify" decision is—at least normally—based on the uncompressed size, which Git doesn't know until it's uncompressed the whole thing.) – torek Jan 12 '19 at 16:54
  • By setting the `depth` to 1, it makes it so the longest chain is of length 1... Which is no delta. However, while it's doing the 'compressing objects' phase, it appears that no new files are edited. It can run for literally hours and make no different in what files exist or how much space the repository takes, and is definitely not loading them into RAM. – iAdjunct Jan 12 '19 at 17:06
  • It *does* have to deflate them to calculate their size, because the size field in the .pack file is the uncompressed size, but it shouldn't have to re-compress them. Ideally, while writing a file, it should decompress the file to calculate its size, write the header to the pack file, then byte wise copy the loose object into the pack, then move on to the next object. But it has this 'compressing objects' phase first that seems completely unnecessary and doesn't appear to do anything useful at all, but takes days. – iAdjunct Jan 12 '19 at 17:08
  • OK. I think (speculation, because again I haven't gone diving into the source) that Git is still spending a lot of time looking for suitable delta chains, doing all the computation and only then deciding that, nope, this pair of objects can't be deltified because the chain just got too long. If that's the case, try setting `core.bigFileThreshold` to 1. Yes, there's no need to re-deflate, and that would probably be a worthwhile optimization. – torek Jan 12 '19 at 17:10
  • Also, but this would be a less general optimization, Git doesn't have to *fully* reinflate an object just to get its size: the size is in the header, so getting just a chunk would suffice. (Git also likes to recalculate the hash to verify data integrity, though, so it might want to do a full deflate anyway.) – torek Jan 12 '19 at 17:13
  • One thing to note is that there is a separate 'performing delta compression' step which no longer occurs. It used to spend a long time calculating delta compression, *then* say 'compressing objects' but now it skips the delta. This suggests that delta compression is *not* part of the 'compressing objects' status it reports. However, I believe you're right that setting `core.bigFileThreshold` would give it an even stronger hint to never, ever, ever try to delta-compress .nef files (or anything else). – iAdjunct Jan 12 '19 at 17:13
  • The size is in the header of the pack file? Or are you saying the size is also in the zlib deflated header? I know the first one; I was saying it has to deflate the loose to get the size to put in the pack header. – iAdjunct Jan 12 '19 at 17:17
  • It's at the front of the loose object, as an ASCII text string: `b'blob 12345\0....'` for instance is a 12345-byte long blob object. Pack files encode the object type differently, probably because pack files were added after the initial loose object support, which was presumably more flexible in case more object types were to become useful. – torek Jan 12 '19 at 17:37
0

Note: regarding the --depth argument of git pack-objects, this is described by torek in "How to reduce the depth of an existing git clone?" as:

the maximum length of a delta chain, when Git uses its modified xdelta compression on Git objects stored in each pack-file.
This has nothing to do with the depth of particular parts of the commit DAG (as computed from each branch head).

As such, Git 2.32 (Q2 2021) is clearer:

Options to "git pack-objects"(man) that take numeric values like --window and --depth should not accept negative values; the input validation has been tightened.

See commit 6d52b6a, commit 49ac1d3, commit 953aa54, commit 9535678, commit 5489899 (01 May 2021) by Jeff King (peff).
(Merged by Junio C Hamano -- gitster -- in commit 1af57f5, 11 May 2021)

pack-objects: clamp negative depth to 0

Signed-off-by: Jeff King

A negative delta depth makes no sense, and the code is not prepared to handle it.
If passed "--depth=-1" on the command line, then this line from break_delta_chains():

cur->depth = (total_depth--) % (depth + 1);

triggers a divide-by-zero.
This is undefined behavior according to the C standard, but on POSIX systems results in SIGFPE killing the process.
This is certainly one way to inform the use that the command was invalid, but it's a bit friendlier to just treat it as "don't allow any deltas", which we already do for --depth=0.

VonC
  • 1,262,500
  • 529
  • 4,410
  • 5,250