Other suggestions for optimal packing or memory management are welcome.
Git 2.20 (Q4 2018) will have one optimization: When there are too many packfiles in a repository (which is not recommended), looking up an object in these would require consulting many pack .idx
files; a new mechanism to have a single file that consolidates all of these .idx
files is introduced.
See commit 6a22d52, commit e9ab2ed, commit 454ea2e, commit 0bff526, commit 29e2016, commit fe86c3b, commit c39b02a, commit 2cf489a, commit 6d68e6a (20 Aug 2018), commit ceab693 (12 Jul 2018) by Derrick Stolee (derrickstolee
).
(Merged by Junio C Hamano -- gitster
-- in commit 49f210f, 17 Sep 2018)
pack-objects
: consider packs in multi-pack-index
When running 'git pack-objects --local
', we want to avoid packing objects that are in an alternate.
Currently, we check for these objects using the packed_git_mru list, which excludes the pack-files covered by a multi-pack-index.
There is a new setting:
core.multiPackIndex::
Use the multi-pack-index file to track multiple packfiles using a single index.
And that multi-pack index is explained here and in Documentation/technical/multi-pack-index.txt
:
Multi-Pack-Index (MIDX) Design Notes
The Git object directory contains a 'pack
' directory containing:
- packfiles (with suffix "
.pack
") and
- pack-indexes (with suffix "
.idx
").
The pack-indexes provide a way to lookup objects and navigate to their offset within the pack, but these must come in pairs with the packfiles.
This pairing depends on the file names, as the pack-index differs only in suffix with its pack-file.
While the pack-indexes provide fast lookup per packfile, this performance degrades as the number of packfiles increases, because abbreviations need to inspect every packfile and we are more likely to have a miss on our most-recently-used packfile.
For some large repositories, repacking into a single packfile is not feasible due to storage space or excessive repack times.
The multi-pack-index (MIDX for short) stores a list of objects and their offsets into multiple packfiles.
It contains:
- A list of packfile names.
- A sorted list of object IDs.
- A list of metadata for the ith object ID including:
- A value j referring to the jth packfile.
- An offset within the jth packfile for the object.
- If large offsets are required, we use another list of large
offsets similar to version 2 pack-indexes.
Thus, we can provide O(log N)
lookup time for any number of packfiles.
Git 2.23 (Q3 2019) adds two commands, with "git multi-pack-index
" learning the expire and repack subcommands.
See commit 3612c23 (01 Jul 2019), and commit b526d8c, commit 10bfa3f, commit d274331, commit ce1e4a1, commit 2af890b, commit 19575c7, commit d01bf2e, commit dba6175, commit cff9711, commit 81efa16, commit 8434e85 (10 Jun 2019) by Derrick Stolee (derrickstolee
).
Helped-by: Johannes Schindelin (dscho
).
(Merged by Junio C Hamano -- gitster
-- in commit 4308d81, 19 Jul 2019)
multi-pack-index
: prepare for/implement 'expire
' subcommand
The multi-pack-index tracks objects in a collection of pack-files.
Only one copy of each object is indexed, using the modified time of the pack-files to determine tie-breakers.
It is possible to have a pack-file with no referenced objects because all objects have a duplicate in a newer pack-file.
Introduce a new 'expire
' subcommand to the multi-pack-index builtin.
This subcommand will delete these unused pack-files and rewrite the multi-pack-index to no longer refer to those files.
The 'git multi-pack-index expire
' subcommand:
- looks at the existing multi-pack-index,
- counts the number of objects referenced in each pack-file,
- deletes the pack-files with no referenced objects, and
- rewrites the multi-pack-index to no longer reference those packs.
Documentation:
expire
:
Delete the pack-files that are tracked by the MIDX file, but have no objects referenced by the MIDX. Rewrite the MIDX file afterward to remove all references to these pack-files.
And:
multi-pack-index: prepare/implement 'repack
' subcommand
In an environment where the multi-pack-index is useful, it is due to many pack-files and an inability to repack the object store into a single pack-file. However, it is likely that many of these pack-files are rather small, and could be repacked into a slightly larger pack-file without too much effort.
It may also be important to ensure the object store is highly available and the repack operation does not interrupt concurrent git
commands.
Introduce a 'repack
' subcommand to 'git multi-pack-index
' that takes a '--batch-size
' option.
The subcommand will inspect the multi-pack-index for referenced pack-files whose size is smaller than the batch size, until collecting a list of pack-files whose sizes sum to larger than the batch size.
Then, a new pack-file will be created containing the objects from those pack-files that are referenced by the multi-pack-index.
The resulting pack is likely to actually be smaller than the batch size due to
compression and the fact that there may be objects in the pack-files that have duplicate copies in other pack-files.
The 'git multi-pack-index repack
' command can take a batch size of zero, which creates a new pack-file containing all objects in the multi-pack-index.
Using a batch size of zero is very similar to a standard 'git repack
' command, except that we do not delete the old packs and instead rely on the new multi-pack-index to prevent new processes from reading the old packs.
This does not disrupt other Git processes that are currently reading the old packs based on the old multi-pack-index.
The first 'repack
' command will create one new pack-file, and an 'expire
' command after that will delete the old pack-files, as they no longer contain any referenced objects in the multi-pack-index.
Documentation:
repack:
Create a new pack-file containing objects in small pack-files referenced by the multi-pack-index.
If the size given by the --batch-size=<size>
argument is zero, then create a pack containing all objects referenced by the multi-pack-index.
For a non-zero batch size:
- select the pack-files by examining packs from oldest-to-newest,
- computing the "expected size" by counting the number of objects in the pack referenced by the multi-pack-index,
- then divide by the total number of objects in the pack and
- multiply by the pack size.
We select packs with expected size below the batch size until the set of packs have total expected size at least the batch size.
- If the total size does not reach the batch size, then do nothing.
- If a new pack-file is created, rewrite the
multi-pack-index
to reference the new pack-file.
A later run of 'git multi-pack-index expire' will delete the pack-files that were part of this batch.
With Git 2.25 (Q1 2020), the code to generate multi-pack index learned to show (or not to show) progress indicators.
That can be useful for large binaries.
See commit 680cba2, commit 64d80e7, commit ad60096, commit 8dc18f8, commit 840cef0, commit efbc3ae (21 Oct 2019) by William Baker (wjbaker101
).
(Merged by Junio C Hamano -- gitster
-- in commit 8f1119b, 10 Nov 2019)
Signed-off-by: William Baker
Add the --[no-]progress
option to git multi-pack-index
.
Pass the MIDX_PROGRESS
flag to the subcommand functions when progress should be displayed by multi-pack-index.
The progress feature was added to 'verify
' in 144d703 ("multi-pack-index
: report progress during 'verify'", 2018-09-13, Git v2.20.0-rc0 -- merge listed in batch #3) but some subcommands were not updated to display progress, and the ability to opt-out was overlooked.
Don't forget to read Documentation/technical/pack-format.txt
, which includes multi-pack-index (MIDX) file format description.
With Git 2.25.1 (Feb. 2020), there is a documentation fix.
See commit eb31044 (07 Feb 2020) by Johannes Berg (berghallen
).
(Merged by Junio C Hamano -- gitster
-- in commit 0410c2b, 12 Feb 2020)
pack-format
: correct multi-pack-index description
Signed-off-by: Johannes Berg
Acked-by: Derrick Stolee
The description of the multi-pack-index contains a small bug, if all offsets are < 2^32
then there will be no LOFF
chunk, not only if they're all < 2^31
(since the highest bit is only needed as the "LOFF
-escape" when that's actually needed.)
Correct this, and clarify that in that case only offsets up to 2^31-1
can be stored in the OOFF
chunk.
The documentation for pack-format
now includes:
2: The offset within the pack.
If all offsets are less than 2^32
, then the large offset chunk will not exist and offsets are stored as in IDX v1.
If there is at least one offset value larger than 2^32-1, then the large offset chunk must exist, and offsets larger than 2^31-1
must be stored in it instead.
If the large offset chunk exists and the 31st bit is on, then removing that bit reveals the row in the large offsets containing the 8-byte offset of this object.
Before Git 2.27 (Q2 2020), when fed a midx (Multi-Pack-Index) that records no objects, some codepaths tried to loop from 0 through (num_objects-1),
which, due to integer arithmetic wrapping around, made it nonsense operation with out of bounds array accesses.
The code has been corrected to reject such an midx file.
See commit 796d61c (28 Mar 2020) by Damien Robert (damiens-robert
).
(Merged by Junio C Hamano -- gitster
-- in commit 8777ec1, 22 Apr 2020)
midx.c
: fix an integer underflow
Signed-off-by: Damien Robert
When verifying a midx index with 0 objects, the
m->num_objects - 1
underflows and wraps around to 4294967295.
Fix this both by checking that the midx contains at least one oid, and also that we don't write any midx when there is no packfiles.
Update the tests to check that git multi-pack-index write
does not write an midx when there is no objects, and another to check that git multi-pack-index verify
warns when it verifies an midx with no objects.
With Git 2.27 (Q2 2020), "git multi-pack-index repack
" has been taught to honor some repack.*
configuration variables.
See commit 3ce4ca0 (10 May 2020) by Derrick Stolee (derrickstolee
).
See commit e11d86d (10 May 2020) by Son Luong Ngoc (sluongng
).
(Merged by Junio C Hamano -- gitster
-- in commit 6baba94, 14 May 2020)
midx
: teach "git multi-pack-index repack
" honor "git repack
" configurations
Signed-off-by: Son Luong Ngoc
When the "repack
" subcommand of "git multi-pack-index
" command creates new packfile(s), it does not call the "git repack
" command but instead directly calls the "git pack-objects
" command, and the configuration variables meant for the "git repack
" command, like "repack.usedaeltabaseoffset
", are ignored.
Check the configuration variables used by "git repack
" ourselves in "git multi-index-pack
" and pass the corresponding options to underlying "git pack-objects
".
Note that repack.writeBitmaps
configuration is ignored, as the pack bitmap facility is useful only with a single packfile.
And:
multi-pack-index
: respect repack.packKeptObjects=false
Reported-by: Son Luong Ngoc
Signed-off-by: Derrick Stolee
When selecting a batch of pack-files to repack in the "git multi-pack-index repack
" command, Git should respect the repack.packKeptObjects
config option.
When false, this option says that the pack-files with an associated ".keep
" file should not be repacked.
This config value is "false
" by default.
There are two cases for selecting a batch of objects.
The first is the case where the input batch-size is zero, which specifies "repack everything".
The second is with a non-zero batch size, which selects pack-files using a greedy selection criteria.
Both of these cases are updated and tested.
With Git 2.29 (Q4 2020), the "--batch-size
" option of "git multi-pack-index repack
"(man) command is now used to specify that very small packfiles are collected into one until the total size roughly exceeds it.
See commit 1eb22c7 (11 Aug 2020) by Derrick Stolee (derrickstolee
).
(Merged by Junio C Hamano -- gitster
-- in commit 9e8c754, 24 Aug 2020)
multi-pack-index
: repack batches below --batch-size
Signed-off-by: Derrick Stolee
Reviewed-by: Taylor Blau
The --batch-size= option of 'git multi-pack-index repack
(man) ' is intended to limit the amount of work done by the repack. In the case of a large repository, this command should repack a number of small pack-files but leave the large pack-files alone. Most often, the repository has one large pack-file from a 'git clone
(man) ' operation and number of smaller pack-files from incremental 'git fetch
(man) ' operations.
The issue with '--batch-size
' is that it also prevents the repack from happening if the expected size of the resulting pack-file is too small.
This was intended as a way to avoid frequent churn of small pack-files, but it has mostly caused confusion when a repository is of "medium" size.
That is, not enormous like the Windows OS repository, but also not so small that this incremental repack isn't valuable.
The solution presented here is to collect pack-files for repack if their expected size is smaller than the batch-size parameter until either the total expected size exceeds the batch-size or all pack-files are considered.
If there are at least two pack-files, then these are combined to a new pack-file whose size should not be too much larger than the batch-size.
This new strategy should succeed in keeping the number of pack-files small in these "medium" size repositories. The concern about churn is likely not interesting, as the real control over that is the frequency in which the repack command is run.
git multi-pack-index
now includes in its man page:
We select packs with expected size below the batch size until the set of packs have total expected size at least the batch size, or all pack-files are considered.
If only one pack-file is selected, then do nothing.
If a new pack-file is created, rewrite the multi-pack-index to reference the new pack-file.
A later run of 'git multi-pack-index expire
' will delete the pack-files that
were part of this batch.
When a packfile is removed by "git repack
"(man), multi-pack-index
gets cleared; the code was taught to do so less aggressively with Git 2.29 (Q4 2020) by first checking if the midx actually refers to a pack that no longer exists.
See commit 59552fb (28 Aug 2020), and commit e08f7bb (25 Aug 2020) by Taylor Blau (ttaylorr
).
(Merged by Junio C Hamano -- gitster
-- in commit a31677d, 09 Sep 2020)
builtin/repack.c
: invalidate MIDX only when necessary
Helped-by: Derrick Stolee
Signed-off-by: Taylor Blau
In 525e18c04b ("midx
: clear midx on repack", 2018-07-12, Git v2.20.0-rc0 -- merge listed in batch #1), 'git repack
(man) ' learned to remove a multi-pack-index file if it added or removed a pack from the object store.
This mechanism is a little over-eager, since it is only necessary to drop a MIDX if 'git repack
(man)' removes a pack that the MIDX references.
Adding a pack outside of the MIDX does not require invalidating the MIDX, and likewise for removing a pack the MIDX does not know about.
Teach 'git repack
(man) ' to check for this by loading the MIDX, and checking whether the to-be-removed pack is known to the MIDX.
A new test is added to show that the MIDX is left alone when both packs known to it are marked as .keep
, but two packs unknown to it are removed and combined into one new pack.
With Git 2.32 (Q2 2021), there is an on-disk reverse-index to map the in-pack location of an object back to its object name across multiple packfiles.
See commit 3007752 (30 Mar 2021) by Jeff King (peff
).
See commit 38ff7ca, commit a587b5a, commit f894081, commit b25fd24, commit 62f2c1b, commit 9f19161, commit 7240cc4, commit 9218c6a, commit 86d174b, commit cd57bc4, commit 690eb05, commit 60ca947, commit b25b727, commit cf1f538, commit f7c4d63 (30 Mar 2021) by Taylor Blau (ttaylorr
).
See commit 1187556 (24 Feb 2021) by Junio C Hamano (gitster
).
(Merged by Junio C Hamano -- gitster
-- in commit e6b971f, 08 Apr 2021)
Co-authored-by: Jeff King
Signed-off-by: Jeff King
Signed-off-by: Taylor Blau
As a prerequisite to implementing multi-pack bitmaps, motivate and describe the format and ordering of the multi-pack reverse index.
technical/pack-format
now includes in its man page:
multi-pack-index reverse indexes
Similar to the pack-based reverse index, the multi-pack index can also
be used to generate a reverse index.
Instead of mapping between offset, pack-
, and index position, this
reverse index maps between an object's position within the MIDX, and
that object's position within a pseudo-pack that the MIDX describes
(i.e., the ith entry of the multi-pack reverse index holds the MIDX
position of ith object in pseudo-pack order).
To clarify the difference between these orderings, consider a multi-pack
reachability bitmap (which does not yet exist, but is what we are
building towards here). Each bit needs to correspond to an object in the
MIDX, and so we need an efficient mapping from bit position to MIDX
position.
One solution is to let bits occupy the same position in the oid-sorted
index stored by the MIDX. But because oids are effectively random, their
resulting reachability bitmaps would have no locality, and thus compress
poorly. (This is the reason that single-pack bitmaps use the pack
ordering, and not the .idx ordering, for the same purpose.)
So we'd like to define an ordering for the whole MIDX based around
pack ordering, which has far better locality (and thus compresses more
efficiently). We can think of a pseudo-pack created by the concatenation
of all of the packs in the MIDX. E.g., if we had a MIDX with three packs
(a, b, c), with 10, 15, and 20 objects respectively, we can imagine an
ordering of the objects like:
|a,0|a,1|...|a,9|b,0|b,1|...|b,14|c,0|c,1|...|c,19|
where the ordering of the packs is defined by the MIDX's pack list,
and then the ordering of objects within each pack is the same as the
order in the actual packfile.
Objects from the MIDX are ordered as follows to string together the
pseudo-pack. Let pack(o)
return the pack from which o
was selected
by the MIDX, and define an ordering of packs based on their numeric ID
(as stored by the MIDX). Let offset(o)
return the object offset of o
within pack(o)
. Then, compare o1
and o2
as follows:
- If one of
pack(o1)
and pack(o2)
is preferred and the other
is not, then the preferred one sorts first.
(This is a detail that allows the MIDX bitmap to determine which
pack should be used by the pack-reuse mechanism, since it can ask
the MIDX for the pack containing the object at bit position 0).
If pack(o1) ≠ pack(o2)
, then sort the two objects in descending
order based on the pack ID.
Otherwise, pack(o1) = pack(o2)
, and the objects are sorted in
pack-order (i.e., o1
sorts ahead of o2
exactly when offset(o1) < offset(o2)
).
In short, a MIDX's pseudo-pack is the de-duplicated concatenation of
objects in packs stored by the MIDX, laid out in pack order, and the
packs arranged in MIDX order (with the preferred pack coming first).