3

I'm trying to convert a large history from Perforce to Git, and one folder (now git branch) contains a significant number of large binary files. My problem is that I'm running out of memory while running git gc --aggressive.

My primary question here is whether repacking the repository is likely to have any meaningful effect on large binaries. Compressing them another 20% would be great. 0.2% isn't worth my effort. If not, I'll have them skipped over as suggested here.

For background, I successfully used git p4 to create the repository in a state I'm happy with, but this uses git fast-import behind the scenes so I want to optimize the repository before making it official, and indeed making any commits automatically triggered a slow gc --auto. It's currently ~35GB in a bare state.

The binaries in question seem to be, conceptually, the vendor firmware used in embedded devices. I think there are approximately 25 in the 400-700MB range and maybe a couple hundred more in the 20-50MB range. They might be disk images, but I'm unsure of that. There's a variety of versions and file types over time, and I see .zip, tgz, and .simg files frequently. As such, I'd expect the raw code to have significant overlap, but I'm not sure how similar the actual files appear at this point, as I believe these formats have already been compressed, right?

These binaries are contained in one (old) branch that will be used excessively rarely (to the point questioning version control at all is valid, but out of scope). Certainly the performance of that branch does not need to be great. But I'd like the rest of the repository to be reasonable.

Other suggestions for optimal packing or memory management are welcome. I admit I don't really understand the various git options being discussed on the linked question. Nor do I really understand what the --window and --depth flags are doing in git repack. But the primary question is whether the repacking of the binaries themselves is doing anything meaningful.

ojchase
  • 1,041
  • 1
  • 10
  • 22
  • Git 2.20 (Q4 2018) should optimize pack-files, making the repo more robust. See [my answer below](https://stackoverflow.com/a/52452349/6309). – VonC Sep 22 '18 at 00:09
  • See also with Git 2.38 (Q3 2022) the new setting [`git -c push.useBitmaps=false push`](https://stackoverflow.com/a/73012939/6309), to disable packing for `git push`. – VonC Jul 17 '22 at 15:20

4 Answers4

4

My primary question here is whether repacking the repository is likely to have any meaningful effect on large binaries.

That depends on their contents. For the files you've outlined specifically:

I see .zip, tgz, and .simg files frequently.

Zipfiles and tgz (gzipped tar archive) files are already compressed and have terrible (i.e., high) Shannon entropy values—terrible for Git that is—and will not compress against each other. The .simg files are probably (I have to guess here) Singularity disk image files; whether and how they are compressed, I don't know, but I would assume they are. (An easy test is to feed one to a compressor, e.g., gzip, and see if it shrinks.)

As such, I'd expect the raw code to have significant overlap, but I'm not sure how similar the actual files appear at this point, as I believe these formats have already been compressed, right?

Precisely. Storing them uncompressed in Git would thus, paradoxically, result in far greater compression in the end. (But the packing could require significant amounts of memory.)

If [this is probably futile], I'll have them skipped over as suggested here.

That would be my first impulse here. :-)

I admit I don't really understand the various git options being discussed on the linked question. Nor do I really understand what the --window and --depth flags are doing in git repack.

The various limits are confusing (and profuse). It's also important to realize that they don't get copied on clone, since they are in .git/config which is not a committed file, so new clones won't pick them up. The .gitattributes file is copied on clone and new clones will continue to avoid packing unpackable files, so it's the better approach here.

(If you care to dive into the details, you will find some in the Git technical documentation. This does not discuss precisely what the window sizes are about, but it has to do with how much memory Git uses to memory-map object data when selecting objects that might compress well against each other. There are two: one for each individual mmap on one pack file, and one for the total aggregate mmap on all pack files. Not mentioned on your link: core.deltaBaseCacheLimit, which is how much memory will be used to hold delta bases—but to understand this you need to grok delta compression and delta chains,1 and read that same technical documentation. Note that Git will default to not attempting to pack any file object whose size exceeds core.bigFileThreshold. The various pack.* controls are a bit more complex: the packing is done multi-threaded to take advantage of all your CPUs if possible, and each thread can use a lot of memory. Limiting the number of threads limits total memory use: if one thread is going to use 256 MB, 8 threads are likely to use 8*256 = 2048 MB or 2 GB. The bitmaps mainly speed up fetching from busy servers.)


1They're not that complicated: a delta chain occurs when one object says "take object XYZ and apply these changes", but object XYZ itself says "take object PreXYZ and apply these changes". Object PreXYZ can also take another object, and so on. The delta base is the object at the bottom of this list.

torek
  • 448,244
  • 59
  • 642
  • 775
  • Thank you! That's what I was afraid of... oh well, at least it's running faster this time That was a fascinating documentation read. I definitely can't say I grok it, but I'm getting the general idea, but not enough to want to tamper with those values. But excluding the large files does the trick anyway. Followup: is there an easy way to know which binary types are compressible / low entropy? – ojchase Jun 23 '18 at 04:50
  • 1
    Best answer (not by me; see [this one](https://stackoverflow.com/a/34924195/1256452) with the equations for details) is [here](https://stackoverflow.com/q/990477/1256452). If you don't want to write your own code, well, I'm not sure if this qualifies as "easy" but in general, if you feed a low-entropy file to a compressor it should shrink a lot, and if you feed a high-entropy file to one, it should not shrink much if at all (and it may even get larger). – torek Jun 23 '18 at 04:59
1

Other suggestions for optimal packing or memory management are welcome.

Git 2.20 (Q4 2018) will have one optimization: When there are too many packfiles in a repository (which is not recommended), looking up an object in these would require consulting many pack .idx files; a new mechanism to have a single file that consolidates all of these .idx files is introduced.

See commit 6a22d52, commit e9ab2ed, commit 454ea2e, commit 0bff526, commit 29e2016, commit fe86c3b, commit c39b02a, commit 2cf489a, commit 6d68e6a (20 Aug 2018), commit ceab693 (12 Jul 2018) by Derrick Stolee (derrickstolee).
(Merged by Junio C Hamano -- gitster -- in commit 49f210f, 17 Sep 2018)

pack-objects: consider packs in multi-pack-index

When running 'git pack-objects --local', we want to avoid packing objects that are in an alternate.
Currently, we check for these objects using the packed_git_mru list, which excludes the pack-files covered by a multi-pack-index.

There is a new setting:

core.multiPackIndex::

Use the multi-pack-index file to track multiple packfiles using a single index.

And that multi-pack index is explained here and in Documentation/technical/multi-pack-index.txt:

Multi-Pack-Index (MIDX) Design Notes

The Git object directory contains a 'pack' directory containing:

  • packfiles (with suffix ".pack") and
  • pack-indexes (with suffix ".idx").

The pack-indexes provide a way to lookup objects and navigate to their offset within the pack, but these must come in pairs with the packfiles.
This pairing depends on the file names, as the pack-index differs only in suffix with its pack-file.

While the pack-indexes provide fast lookup per packfile, this performance degrades as the number of packfiles increases, because abbreviations need to inspect every packfile and we are more likely to have a miss on our most-recently-used packfile.

For some large repositories, repacking into a single packfile is not feasible due to storage space or excessive repack times.

The multi-pack-index (MIDX for short) stores a list of objects and their offsets into multiple packfiles.
It contains:

  • A list of packfile names.
  • A sorted list of object IDs.
  • A list of metadata for the ith object ID including:
  • A value j referring to the jth packfile.
  • An offset within the jth packfile for the object.
  • If large offsets are required, we use another list of large offsets similar to version 2 pack-indexes.

Thus, we can provide O(log N) lookup time for any number of packfiles.


Git 2.23 (Q3 2019) adds two commands, with "git multi-pack-index" learning the expire and repack subcommands.

See commit 3612c23 (01 Jul 2019), and commit b526d8c, commit 10bfa3f, commit d274331, commit ce1e4a1, commit 2af890b, commit 19575c7, commit d01bf2e, commit dba6175, commit cff9711, commit 81efa16, commit 8434e85 (10 Jun 2019) by Derrick Stolee (derrickstolee).
Helped-by: Johannes Schindelin (dscho).
(Merged by Junio C Hamano -- gitster -- in commit 4308d81, 19 Jul 2019)

multi-pack-index: prepare for/implement 'expire' subcommand

The multi-pack-index tracks objects in a collection of pack-files.
Only one copy of each object is indexed, using the modified time of the pack-files to determine tie-breakers.
It is possible to have a pack-file with no referenced objects because all objects have a duplicate in a newer pack-file.

Introduce a new 'expire' subcommand to the multi-pack-index builtin.
This subcommand will delete these unused pack-files and rewrite the multi-pack-index to no longer refer to those files
.

The 'git multi-pack-index expire' subcommand:

  • looks at the existing multi-pack-index,
  • counts the number of objects referenced in each pack-file,
  • deletes the pack-files with no referenced objects, and
  • rewrites the multi-pack-index to no longer reference those packs.

Documentation:

expire:

Delete the pack-files that are tracked by the MIDX file, but have no objects referenced by the MIDX. Rewrite the MIDX file afterward to remove all references to these pack-files.

And:

multi-pack-index: prepare/implement 'repack' subcommand

In an environment where the multi-pack-index is useful, it is due to many pack-files and an inability to repack the object store into a single pack-file. However, it is likely that many of these pack-files are rather small, and could be repacked into a slightly larger pack-file without too much effort.
It may also be important to ensure the object store is highly available and the repack operation does not interrupt concurrent git commands.

Introduce a 'repack' subcommand to 'git multi-pack-index' that takes a '--batch-size' option.

The subcommand will inspect the multi-pack-index for referenced pack-files whose size is smaller than the batch size, until collecting a list of pack-files whose sizes sum to larger than the batch size.
Then, a new pack-file will be created containing the objects from those pack-files that are referenced by the multi-pack-index.

The resulting pack is likely to actually be smaller than the batch size due to compression and the fact that there may be objects in the pack-files that have duplicate copies in other pack-files.

The 'git multi-pack-index repack' command can take a batch size of zero, which creates a new pack-file containing all objects in the multi-pack-index.

Using a batch size of zero is very similar to a standard 'git repack' command, except that we do not delete the old packs and instead rely on the new multi-pack-index to prevent new processes from reading the old packs.
This does not disrupt other Git processes that are currently reading the old packs based on the old multi-pack-index.

The first 'repack' command will create one new pack-file, and an 'expire' command after that will delete the old pack-files, as they no longer contain any referenced objects in the multi-pack-index.

Documentation:

repack:

Create a new pack-file containing objects in small pack-files referenced by the multi-pack-index.
If the size given by the --batch-size=<size> argument is zero, then create a pack containing all objects referenced by the multi-pack-index.

For a non-zero batch size:

  • select the pack-files by examining packs from oldest-to-newest,
  • computing the "expected size" by counting the number of objects in the pack referenced by the multi-pack-index,
  • then divide by the total number of objects in the pack and
  • multiply by the pack size.

We select packs with expected size below the batch size until the set of packs have total expected size at least the batch size.

  • If the total size does not reach the batch size, then do nothing.
  • If a new pack-file is created, rewrite the multi-pack-index to reference the new pack-file.
    A later run of 'git multi-pack-index expire' will delete the pack-files that were part of this batch.

With Git 2.25 (Q1 2020), the code to generate multi-pack index learned to show (or not to show) progress indicators.

That can be useful for large binaries.

See commit 680cba2, commit 64d80e7, commit ad60096, commit 8dc18f8, commit 840cef0, commit efbc3ae (21 Oct 2019) by William Baker (wjbaker101).
(Merged by Junio C Hamano -- gitster -- in commit 8f1119b, 10 Nov 2019)

multi-pack-index: add [--[no-]progress] option.

Signed-off-by: William Baker

Add the --[no-]progress option to git multi-pack-index.
Pass the MIDX_PROGRESS flag to the subcommand functions when progress should be displayed by multi-pack-index.

The progress feature was added to 'verify' in 144d703 ("multi-pack-index: report progress during 'verify'", 2018-09-13, Git v2.20.0-rc0 -- merge listed in batch #3) but some subcommands were not updated to display progress, and the ability to opt-out was overlooked.


Don't forget to read Documentation/technical/pack-format.txt, which includes multi-pack-index (MIDX) file format description. With Git 2.25.1 (Feb. 2020), there is a documentation fix.

See commit eb31044 (07 Feb 2020) by Johannes Berg (berghallen).
(Merged by Junio C Hamano -- gitster -- in commit 0410c2b, 12 Feb 2020)

pack-format: correct multi-pack-index description

Signed-off-by: Johannes Berg
Acked-by: Derrick Stolee

The description of the multi-pack-index contains a small bug, if all offsets are < 2^32 then there will be no LOFF chunk, not only if they're all < 2^31 (since the highest bit is only needed as the "LOFF-escape" when that's actually needed.)

Correct this, and clarify that in that case only offsets up to 2^31-1 can be stored in the OOFF chunk.

The documentation for pack-format now includes:

2: The offset within the pack.

If all offsets are less than 2^32, then the large offset chunk will not exist and offsets are stored as in IDX v1.
If there is at least one offset value larger than 2^32-1, then the large offset chunk must exist, and offsets larger than 2^31-1 must be stored in it instead.
If the large offset chunk exists and the 31st bit is on, then removing that bit reveals the row in the large offsets containing the 8-byte offset of this object.


Before Git 2.27 (Q2 2020), when fed a midx (Multi-Pack-Index) that records no objects, some codepaths tried to loop from 0 through (num_objects-1), which, due to integer arithmetic wrapping around, made it nonsense operation with out of bounds array accesses.

The code has been corrected to reject such an midx file.

See commit 796d61c (28 Mar 2020) by Damien Robert (damiens-robert).
(Merged by Junio C Hamano -- gitster -- in commit 8777ec1, 22 Apr 2020)

midx.c: fix an integer underflow

Signed-off-by: Damien Robert

When verifying a midx index with 0 objects, the m->num_objects - 1 underflows and wraps around to 4294967295.

Fix this both by checking that the midx contains at least one oid, and also that we don't write any midx when there is no packfiles.

Update the tests to check that git multi-pack-index write does not write an midx when there is no objects, and another to check that git multi-pack-index verify warns when it verifies an midx with no objects.


With Git 2.27 (Q2 2020), "git multi-pack-index repack" has been taught to honor some repack.* configuration variables.

See commit 3ce4ca0 (10 May 2020) by Derrick Stolee (derrickstolee).
See commit e11d86d (10 May 2020) by Son Luong Ngoc (sluongng).
(Merged by Junio C Hamano -- gitster -- in commit 6baba94, 14 May 2020)

midx: teach "git multi-pack-index repack" honor "git repack" configurations

Signed-off-by: Son Luong Ngoc

When the "repack" subcommand of "git multi-pack-index" command creates new packfile(s), it does not call the "git repack" command but instead directly calls the "git pack-objects" command, and the configuration variables meant for the "git repack" command, like "repack.usedaeltabaseoffset", are ignored.

Check the configuration variables used by "git repack" ourselves in "git multi-index-pack" and pass the corresponding options to underlying "git pack-objects".

Note that repack.writeBitmaps configuration is ignored, as the pack bitmap facility is useful only with a single packfile.

And:

multi-pack-index: respect repack.packKeptObjects=false

Reported-by: Son Luong Ngoc
Signed-off-by: Derrick Stolee

When selecting a batch of pack-files to repack in the "git multi-pack-index repack" command, Git should respect the repack.packKeptObjects config option.
When false, this option says that the pack-files with an associated ".keep" file should not be repacked.
This config value is "false" by default.

There are two cases for selecting a batch of objects.
The first is the case where the input batch-size is zero, which specifies "repack everything".
The second is with a non-zero batch size, which selects pack-files using a greedy selection criteria.
Both of these cases are updated and tested.


With Git 2.29 (Q4 2020), the "--batch-size" option of "git multi-pack-index repack"(man) command is now used to specify that very small packfiles are collected into one until the total size roughly exceeds it.

See commit 1eb22c7 (11 Aug 2020) by Derrick Stolee (derrickstolee).
(Merged by Junio C Hamano -- gitster -- in commit 9e8c754, 24 Aug 2020)

multi-pack-index: repack batches below --batch-size

Signed-off-by: Derrick Stolee
Reviewed-by: Taylor Blau

The --batch-size= option of 'git multi-pack-index repack(man) ' is intended to limit the amount of work done by the repack. In the case of a large repository, this command should repack a number of small pack-files but leave the large pack-files alone. Most often, the repository has one large pack-file from a 'git clone(man) ' operation and number of smaller pack-files from incremental 'git fetch(man) ' operations.

The issue with '--batch-size' is that it also prevents the repack from happening if the expected size of the resulting pack-file is too small.

This was intended as a way to avoid frequent churn of small pack-files, but it has mostly caused confusion when a repository is of "medium" size.
That is, not enormous like the Windows OS repository, but also not so small that this incremental repack isn't valuable.

The solution presented here is to collect pack-files for repack if their expected size is smaller than the batch-size parameter until either the total expected size exceeds the batch-size or all pack-files are considered.
If there are at least two pack-files, then these are combined to a new pack-file whose size should not be too much larger than the batch-size.

This new strategy should succeed in keeping the number of pack-files small in these "medium" size repositories. The concern about churn is likely not interesting, as the real control over that is the frequency in which the repack command is run.

git multi-pack-index now includes in its man page:

We select packs with expected size below the batch size until the set of packs have total expected size at least the batch size, or all pack-files are considered.
If only one pack-file is selected, then do nothing.
If a new pack-file is created, rewrite the multi-pack-index to reference the new pack-file.

A later run of 'git multi-pack-index expire' will delete the pack-files that were part of this batch.


When a packfile is removed by "git repack"(man), multi-pack-index gets cleared; the code was taught to do so less aggressively with Git 2.29 (Q4 2020) by first checking if the midx actually refers to a pack that no longer exists.

See commit 59552fb (28 Aug 2020), and commit e08f7bb (25 Aug 2020) by Taylor Blau (ttaylorr).
(Merged by Junio C Hamano -- gitster -- in commit a31677d, 09 Sep 2020)

builtin/repack.c: invalidate MIDX only when necessary

Helped-by: Derrick Stolee
Signed-off-by: Taylor Blau

In 525e18c04b ("midx: clear midx on repack", 2018-07-12, Git v2.20.0-rc0 -- merge listed in batch #1), 'git repack(man) ' learned to remove a multi-pack-index file if it added or removed a pack from the object store.

This mechanism is a little over-eager, since it is only necessary to drop a MIDX if 'git repack(man)' removes a pack that the MIDX references.
Adding a pack outside of the MIDX does not require invalidating the MIDX, and likewise for removing a pack the MIDX does not know about.

Teach 'git repack(man) ' to check for this by loading the MIDX, and checking whether the to-be-removed pack is known to the MIDX.

A new test is added to show that the MIDX is left alone when both packs known to it are marked as .keep, but two packs unknown to it are removed and combined into one new pack.


With Git 2.32 (Q2 2021), there is an on-disk reverse-index to map the in-pack location of an object back to its object name across multiple packfiles.

See commit 3007752 (30 Mar 2021) by Jeff King (peff).
See commit 38ff7ca, commit a587b5a, commit f894081, commit b25fd24, commit 62f2c1b, commit 9f19161, commit 7240cc4, commit 9218c6a, commit 86d174b, commit cd57bc4, commit 690eb05, commit 60ca947, commit b25b727, commit cf1f538, commit f7c4d63 (30 Mar 2021) by Taylor Blau (ttaylorr).
See commit 1187556 (24 Feb 2021) by Junio C Hamano (gitster).
(Merged by Junio C Hamano -- gitster -- in commit e6b971f, 08 Apr 2021)

Documentation/technical: describe multi-pack reverse indexes

Co-authored-by: Jeff King
Signed-off-by: Jeff King
Signed-off-by: Taylor Blau

As a prerequisite to implementing multi-pack bitmaps, motivate and describe the format and ordering of the multi-pack reverse index.

technical/pack-format now includes in its man page:

multi-pack-index reverse indexes

Similar to the pack-based reverse index, the multi-pack index can also be used to generate a reverse index.

Instead of mapping between offset, pack-, and index position, this reverse index maps between an object's position within the MIDX, and that object's position within a pseudo-pack that the MIDX describes (i.e., the ith entry of the multi-pack reverse index holds the MIDX position of ith object in pseudo-pack order).

To clarify the difference between these orderings, consider a multi-pack reachability bitmap (which does not yet exist, but is what we are building towards here). Each bit needs to correspond to an object in the MIDX, and so we need an efficient mapping from bit position to MIDX position.

One solution is to let bits occupy the same position in the oid-sorted index stored by the MIDX. But because oids are effectively random, their resulting reachability bitmaps would have no locality, and thus compress poorly. (This is the reason that single-pack bitmaps use the pack ordering, and not the .idx ordering, for the same purpose.)

So we'd like to define an ordering for the whole MIDX based around pack ordering, which has far better locality (and thus compresses more efficiently). We can think of a pseudo-pack created by the concatenation of all of the packs in the MIDX. E.g., if we had a MIDX with three packs (a, b, c), with 10, 15, and 20 objects respectively, we can imagine an ordering of the objects like:

|a,0|a,1|...|a,9|b,0|b,1|...|b,14|c,0|c,1|...|c,19|

where the ordering of the packs is defined by the MIDX's pack list, and then the ordering of objects within each pack is the same as the order in the actual packfile. Objects from the MIDX are ordered as follows to string together the pseudo-pack. Let pack(o) return the pack from which o was selected by the MIDX, and define an ordering of packs based on their numeric ID (as stored by the MIDX). Let offset(o) return the object offset of o within pack(o). Then, compare o1 and o2 as follows:

  • If one of pack(o1) and pack(o2) is preferred and the other is not, then the preferred one sorts first.

(This is a detail that allows the MIDX bitmap to determine which pack should be used by the pack-reuse mechanism, since it can ask the MIDX for the pack containing the object at bit position 0).

  • If pack(o1) ≠ pack(o2), then sort the two objects in descending order based on the pack ID.

  • Otherwise, pack(o1) = pack(o2), and the objects are sorted in pack-order (i.e., o1 sorts ahead of o2 exactly when offset(o1) < offset(o2)).

In short, a MIDX's pseudo-pack is the de-duplicated concatenation of objects in packs stored by the MIDX, laid out in pack order, and the packs arranged in MIDX order (with the preferred pack coming first).

VonC
  • 1,262,500
  • 529
  • 4,410
  • 5,250
  • See also https://github.com/git/git/commit/88617d11f9d2ee1ea726cef4527d676a9a46fa63 with Git 2.33: `multi-pack-index: fix potential segfault without sub-command`. – VonC Aug 08 '21 at 02:05
  • Git 2.34 fixes a performance regression from Git 2.32: https://github.com/git/git/commit/1ea5e46cb96d17c3b3927b4eff9765183cf87f8d – VonC Sep 26 '21 at 22:52
0

In addition to my previous answer, Git 2.34 (Q4 2021) adds a new feature.

Before, the reachability bitmap file used to be generated only for a single pack, but now Git 2.34 learned to generate bitmaps for history that span across multiple packfiles.

See commit 73cd7d9, commit bfbb60d (09 Sep 2021), and commit eb6e956, commit d3f17e1 (31 Aug 2021) by Jeff King (peff).
See commit 2d59597, commit 9387fbd, commit ff1e653, commit 4b58b6f, commit e255a5e, commit c51f5a6, commit b1b82d1, commit aeb4657, commit c528e17, commit 0f533c7, commit a5f9f24, commit 711260f, commit 6b4277e, commit ed18462, commit 9bb6c2e, commit 177c0d6, commit 5d3cd09, commit f5909d3, commit 426c00e, commit 73ff4ad (31 Aug 2021), commit f57a739 (01 Sep 2021), and commit 917a54c, commit 1d7f7f2, commit 3ba3d06, commit fa95666 (24 Aug 2021) by Taylor Blau (ttaylorr).
(Merged by Junio C Hamano -- gitster -- in commit 0649303, 20 Sep 2021)

midx: infer preferred pack when not given one

Signed-off-by: Taylor Blau

In 9218c6a ("midx: allow marking a pack as preferred", 2021-03-30, Git v2.32.0-rc0 -- merge), the multi-pack index code learned how to select a pack which all duplicate objects are selected from.
That is, if an object appears in multiple packs, select the copy in the preferred pack before breaking ties according to the other rules like pack mtime and readdir() order.

Not specifying a preferred pack can cause serious problems with multi-pack reachability bitmaps, because these bitmaps rely on having at least one pack from which all duplicates are selected.
Not having such a pack causes problems with the code in pack-objects to reuse packs verbatim (e.g., that code assumes that a delta object in a chunk of pack sent verbatim will have its base object sent from the same pack).

So why does not marking a pack preferred cause problems here? The reason is roughly as follows:

  • Ties are broken (when handling duplicate objects) by sorting according to midx_oid_compare(), which sorts objects by OID, preferred-ness, pack mtime, and finally pack ID (more on that later).
  • The psuedo pack-order (described in Documentation/technical/pack-format.txt under the section "multi-pack-index reverse indexes") is computed by midx_pack_order(), and sorts by pack ID and pack offset, with preferred packs sorting first.
  • But! Pack IDs come from incrementing the pack count in add_pack_to_midx(), which is a callback to for_each_file_in_pack_dir(), meaning that pack IDs are assigned in readdir() order.

When specifying a preferred pack, all of that works fine, because duplicate objects are correctly resolved in favor of the copy in the preferred pack, and the preferred pack sorts first in the object order.

"Sorting first" is critical, because the bitmap code relies on finding out which pack holds the first object in the MIDX's pseudo pack-order to determine which pack is preferred.

But if we didn't specify a preferred pack, and the pack which comes first in readdir() order does not also have the lowest timestamp, then it's possible that that pack (the one that sorts first in pseudo-pack order, which the bitmap code will treat as the preferred one) did not have all duplicate objects resolved in its favor, resulting in breakage.

The fix is simple: pick a (semi-arbitrary, non-empty) preferred pack when none was specified.
This forces that pack to have duplicates resolved in its favor, and (critically) to sort first in pseudo-pack order.
Unfortunately, testing this behavior portably isn't possible, since it depends on readdir() order which isn't guaranteed by POSIX.

(Note that multi-pack reachability bitmaps have yet to be implemented; so in that sense this patch is fixing a bug which does not yet exist.
But by having this patch beforehand, we can prevent the bug from ever materializing.)


Note: See also with Git 2.38 (Q3 2022) the new setting git -c push.useBitmaps=false push, to disable packing for git push.


With Git 2.42 (Q3 2023), git repack is more robust when one of those MIDX is corrupt.

So you would see less "could not open pack" error message in that scenario.

See commit 06f3867 (07 Jun 2023) by Taylor Blau (ttaylorr).
(Merged by Junio C Hamano -- gitster -- in commit 1d15be3, 23 Jun 2023)

pack-bitmap.c: gracefully degrade on failure to load MIDX'd pack

Signed-off-by: Taylor Blau

When opening a MIDX bitmap, we the pack-bitmap machinery eagerly calls prepare_midx_pack() on each of the packs contained in the MIDX.
This is done in order to populate the array of struct packed_git *s held by the MIDX, which we need later on in load_reverse_index(), since it calls load_pack_revindex() on each of the MIDX'd packs, and requires that the caller provide a pointer to a struct packed_git``.

When opening one of these packs fails, the pack-bitmap code will die() indicating that it can't open one of the packs in the MIDX.
This indicates that the MIDX is somehow broken with respect to the current state of the repository.
When this is the case, we indeed cannot make use of the MIDX bitmap to speed up reachability traversals.

However, it does not mean that we can't perform reachability traversals at all.
In other failure modes, that same function calls warning() and then returns -1, indicating to its caller (open_bitmap()) that we should either look for a pack bitmap if one is available, or perform normal object traversal without using bitmaps at all.

There is no reason why this case should cause us to die.
If we instead continued (by jumping to cleanup as this patch does) and avoid using bitmaps altogether, we may again try and query the MIDX, which will also fail.
But when trying to call fill_midx_entry() fails, it also returns a signal of its failure, and prompts the caller to try and locate the object elsewhere.

In other words, the normal object traversal machinery works fine in the presence of a corrupt MIDX, so there is no reason that the MIDX bitmap machinery should abort in that case when we could easily continue.

Note that we could in theory try again to load a MIDX bitmap after calling reprepare_packed_git().
Even though the prepare_packed_git() code is careful to avoid adding a pack that we already have, prepare_midx_pack() is not.
So if we got part of the way through calling prepare_midx_pack() on a stale MIDX, and then tried again on a fresh MIDX that contains some of the same packs, we would end up with a loop through the ->next pointer.

For now, let's do the simplest thing possible and fallback to the non-bitmap code when we detect a stale MIDX so that the complete fix as above can be implemented carefully.

VonC
  • 1,262,500
  • 529
  • 4,410
  • 5,250
0

Regarding MIDX ("Multi-Pack-Index", presented here), make sure to use Git 2.36+:

A bug that made multi-pack bitmap and the object order out-of-sync, making the .midx data corrupt, has been fixed with Git 2.36 (Q2 2022).

See commit f8b60cf, commit 7f514b7, commit a80f0f9, commit 791170f, commit f0ed59a, commit 90a8ea4, commit 09a7799, commit 95e8383, commit 61fd31a (25 Jan 2022) by Taylor Blau (ttaylorr).
(Merged by Junio C Hamano -- gitster -- in commit f2cb46a, 16 Feb 2022)

midx: read RIDX chunk when present

Signed-off-by: Taylor Blau
Reviewed-by: Derrick Stolee
Reviewed-by: Jonathan Tan

When a MIDX contains the new RIDX chunk, ensure that the reverse index is read from it instead of the on-disk .rev file.
Since we need to encode the object order in the MIDX itself for correctness reasons, there is no point in storing the same data again outside of the MIDX.

So, this patch stops writing separate .rev files, and reads it out of the MIDX itself.
This is possible to do with relatively little new code, since the format of the RIDX chunk is identical to the data in the .rev file.
In other words, we can implement this by pointing the revindex_data field at the reverse index chunk of the MIDX instead of the .rev file without any other changes.

Note: [RIDX Documentation/technical/pack-format.txt][7]

[Optional] Bitmap pack order (ID: {'R', 'I', 'D', 'X'})

A list of MIDX positions (one per object in the MIDX, num_objects in total, each a 4-byte unsigned integer in network byte order), sorted according to their relative bitmap/pseudo-pack positions.


"git multi-pack-index repack/expire"(man) used to repack unreachable cruft into a new pack, which have been corrected with Git 2.39 (Q4 2022).

See commit b62ad56, commit 0a8e561, commit cb6c48c, commit d9f7721, commit 757d457, commit 2a91b35, commit 2699542 (19 Sep 2022) by Taylor Blau (ttaylorr).
(Merged by Junio C Hamano -- gitster -- in commit a215853, 10 Oct 2022)

midx.c: avoid cruft packs with repack --batch-size=0

Signed-off-by: Taylor Blau

The repack sub-command of the git multi-pack-index(man) builtin creates a new pack aggregating smaller packs contained in the MIDX up to some given --batch-size.

When --batch-size=0, this instructs the MIDX builtin to repack everything contained in the MIDX into a single pack.

In similar spirit as a previous commit, it is undesirable to repack the contents of a cruft pack in this step.
Teach repack to ignore any cruft pack(s) when --batch-size=0 for the same reason(s).

VonC
  • 1,262,500
  • 529
  • 4,410
  • 5,250