You might want to retry the same git repack
, this timewith Git 2.32 (Q2 2021, almost 8 years later), and its new --geometric=<n>
option:
"git repack
"(man) so far has been only capable of repacking everything under the sun into a single pack (or split by size).
A cleverer strategy to reduce the cost of repacking a repository has been introduced.
See commit 14e7b83 (19 Mar 2021), commit 2a15964, commit 13d746a, commit dab3247, commit f25e33c (05 Mar 2021), and commit 0fabafd, commit 339bce2, commit c9fff00, commit f62312e (22 Feb 2021) by Taylor Blau (ttaylorr
).
See commit ccae01c (05 Mar 2021) by Junio C Hamano (gitster
).
See commit 20b031f, commit 6325da1, commit fbf20ae, commit 60bb5f2 (22 Feb 2021) by Jeff King (peff
).
(Merged by Junio C Hamano -- gitster
-- in commit 2744383, 24 Mar 2021)
First: find_kept_pack_entry()
packfile
: introduce 'find_kept_pack_entry()
'
Co-authored-by: Jeff King
Signed-off-by: Jeff King
Signed-off-by: Taylor Blau
Reviewed-by: Jeff King
Future callers will want a function to fill a 'struct pack_entry
' for a given object id but only from its position in any kept pack(s).
In particular, an new 'git repack
'(man) mode which ensures the resulting packs form a geometric progress by object count will mark packs that it does not want to repack as "kept in-core", and it will want to halt a reachability traversal as soon as it visits an object in any of the kept packs.
But, it does not want to halt the traversal at non-kept, or .keep packs.
The obvious alternative is 'find_pack_entry()
', but this doesn't quite suffice since it only returns the first pack it finds, which may or may not be kept (and the mru cache makes it unpredictable which one you'll get if there are options).
Short of that, you could walk over all packs looking for the object in each one, but it scales with the number of packs, which may be prohibitive.
Introduce 'find_kept_pack_entry()
', a function which is like 'find_pack_entry()
', but only fills in objects in the kept packs.
Then: git pack-objects --stdin-packs
Suggested-by: Jeff King
Signed-off-by: Taylor Blau
Reviewed-by: Jeff King
In an upcoming commit, 'git repack
'(man) will want to create a pack comprised of all of the objects in some packs (the included packs) excluding any objects in some other packs (the excluded packs).
This caller could iterate those packs themselves and feed the objects it finds to 'git pack-objects
'(man) directly over stdin, but this approach has a few downsides:
- It requires every caller that wants to drive '
git pack-objects
' in this way to implement pack iteration themselves.
This forces the caller to think about details like what order objects are fed to pack-objects, which callers would likely rather not do.
- If the set of objects in included packs is large, it requires sending a lot of data over a pipe, which is inefficient.
- The caller is forced to keep track of the excluded objects, too, and make sure that it doesn't send any objects that appear in both included and excluded packs.
But the biggest downside is the lack of a reachability traversal.
Because the caller passes in a list of objects directly, those objects don't get a namehash assigned to them, which can have a negative impact on the delta selection process, causing 'git pack-objects
' to fail to find good deltas even when they exist.
The caller could formulate a reachability traversal themselves, but the only way to drive 'git pack-objects
' in this way is to do a full traversal, and then remove objects in the excluded packs after the traversal is complete.
This can be detrimental to callers who care about performance, especially in repositories with many objects.
Introduce 'git pack-objects --stdin-packs
'(man) which remedies these four concerns.
'git pack-objects --stdin-packs
' expects a list of pack names on stdin, where 'pack-xyz.pack
' denotes that pack as included, and '^pack-xyz.pack
' denotes it as excluded.
The resulting pack includes all objects that are present in at least one included pack, and aren't present in any excluded pack.
git pack-objects
now includes in its man page:
--stdin-packs
Read the basenames of packfiles (e.g., pack-1234abcd.pack
)
from the standard input, instead of object names or revision
arguments.
The resulting pack contains all objects listed in the
included packs (those not beginning with ^
), excluding any
objects listed in the excluded packs (beginning with ^
).
Incompatible with --revs
, or options that imply --revs
(such as
--all
), with the exception of --unpacked
, which is compatible.
Finally: git repack --geometric=<n>
Suggested-by: Derrick Stolee
Signed-off-by: Taylor Blau
Reviewed-by: Jeff King
Often it is useful to both:
- have relatively few packfiles in a repository, and
- avoid having so few packfiles in a repository that we repack its entire contents regularly
This patch implements a '--geometric=' option in 'git repack
'(man).
This allows the caller to specify that they would like each pack to be at least a factor times as large as the previous largest pack (by object count).
Concretely, say that a repository has 'n' packfiles, labeled P1, P2, ..., up to Pn.
Each packfile has an object count equal to 'objects(Pn)
'.
With a geometric factor of 'r
', it should be that:
objects(Pi) > r*objects(P(i-1))
for all i in [1, n]
, where the packs are sorted by
objects(P1) <= objects(P2) <= ... <= objects(Pn).
Since finding a true optimal repacking is NP-hard, we approximate it along two directions:
We assume that there is a cutoff of packs before starting the repack where everything to the right of that cut-off already forms a geometric progression (or no cutoff exists and everything must be repacked).
We assume that everything smaller than the cutoff count must be repacked.
This forms our base assumption, but it can also cause even the "heavy" packs to get repacked, for e.g., if we have 6 packs containing the following number of objects:
1, 1, 1, 2, 4, 32
then we would place the cutoff between '1, 1' and '1, 2, 4, 32', rolling up the first two packs into a pack with 2 objects.
That breaks our progression and leaves us:
2, 1, 2, 4, 32
^
(where the '^
' indicates the position of our split).
To restore a progression, we move the split forward (towards larger packs)
joining each pack into our new pack until a geometric progression
is restored.
Here, that looks like:
2, 1, 2, 4, 32 ~> 3, 2, 4, 32 ~> 5, 4, 32 ~> ... ~> 9, 32
^ ^ ^ ^
This has the advantage of not repacking the heavy-side of packs too often while also only creating one new pack at a time.
Another wrinkle is that we assume that loose, indexed, and reflog'd objects are insignificant, and lump them into any new pack that we create.
This can lead to non-idempotent results.
git repack
now includes in its man page:
-g=<factor>
--geometric=<factor>
Arrange resulting pack structure so that each successive pack
contains at least <factor>
times the number of objects as the
next-largest pack.
git repack
ensures this by determining a "cut" of packfiles that need
to be repacked into one in order to ensure a geometric progression. It
picks the smallest set of packfiles such that as many of the larger
packfiles (by count of objects contained in that pack) may be left
intact.
Unlike other repack modes, the set of objects to pack is determined
uniquely by the set of packs being "rolled-up"; in other words, the
packs determined to need to be combined in order to restore a geometric
progression.
When --unpacked
is specified, loose objects are implicitly included in
this "roll-up", without respect to their reachability. This is subject
to change in the future. This option (implying a drastically different
repack mode) is not guaranteed to work with all other combinations of
option to git repack
).
Results:
Test HEAD^ HEAD
-----------------------------------------------------------------------------------------------
5303.5: repack (1) 57.34(54.66+10.88) 56.98(54.36+10.98) -0.6%
5303.6: repack with kept (1) 57.38(54.83+10.49) 57.17(54.97+10.26) -0.4%
5303.11: repack (50) 71.70(88.99+4.74) 71.62(88.48+5.08) -0.1%
5303.12: repack with kept (50) 72.58(89.61+4.78) 71.56(88.80+4.59) -1.4%
5303.17: repack (1000) 217.19(491.72+14.25) 217.31(490.82+14.53) +0.1%
5303.18: repack with kept (1000) 246.12(520.07+14.93) 217.08(490.37+15.10) -11.8%
and the --stdin-packs
case, which scales a little bit better (although
not by that much even at 1,000 packs):
5303.7: repack with --stdin-packs (1) 0.00(0.00+0.00) 0.00(0.00+0.00) =
5303.13: repack with --stdin-packs (50) 3.43(11.75+0.24) 3.43(11.69+0.30) +0.0%
5303.19: repack with --stdin-packs (1000) 130.50(307.15+7.66) 125.13(301.36+8.04) -4.1%
See "Git 2.33: Geometric repacking" from Taylor Blau:
Historically, git repack
did one of two things:
- it either repacked all loose objects into a new pack (optionally deleting the loose copies of each of those objects),
- or it repacked all packs together into a single, new pack (optionally deleting the redundant packs).
Generally speaking, Git has better performance when there are fewer packs, since many operations scale with the number of packs in a repository.
So it’s often a good idea to pack everything together into one single pack. >
But historically speaking, busy repositories often require that all of their contents be packed together into a single, enormous pack.
That’s because reachability bitmaps, which are a critical optimization for server-side Git performance, can only describe the objects in a single pack.
So if you want to use bitmaps to effectively cover many objects in your repository, those objects have to be stored together in the same pack.
We’re working toward removing that limitation (you can read more about how we’ve done that), but one important step along the way is to implement a new repacking scheme that trades off between having a relatively small number of packs, and packing together recently added objects (in other words, approximating the new objects added since the previous repack).
To do that, Git learned a new “geometric” repacking strategy.
The idea is to determine a (small-ish) set of packs which could be combined together so that the remaining packs form a geometric progression based on object size.
In other words, if the smallest pack has N objects, then the next-largest pack would have at least 2N objects, and so on, doubling (or growing by an arbitrary constant) at each step along the way.

Input validation of "git pack-objects --stdin-packs
"(man) has been corrected with Git 2.34 (Q4 2021).
See commit 561fa03 (09 Jul 2021), and commit fb20d4b (21 Jun 2021) by Ævar Arnfjörð Bjarmason (avar
).
(Merged by Junio C Hamano -- gitster
-- in commit 5c933f0, 24 Aug 2021)
pack-objects
: fix segfault in --stdin-packs
option
Signed-off-by: Ævar Arnfjörð Bjarmason
Fix a segfault in the --stdin-packs
option added in 339bce2 ("builtin/pack-objects.c
: add '--stdin-packs' option", 2021-02-22, Git v2.32.0-rc0 -- merge listed in batch #4).
The read_packs_list_from_stdin()
function didn't check that the lines it was reading were valid packs, and thus when doing the QSORT() with pack_mtime_cmp()
we'd have a NULL
"util" field.
The "util" field is used to associate the names of included/excluded packs with the packed_git
structs they correspond to.
The logic error was in assuming that we could iterate all packs and annotate the excluded and included packs we got, as opposed to checking the lines we got on stdin.
There was a check for excluded packs, but included packs were simply assumed to be valid.
As noted in the test we'll not report the first bad line, but whatever line sorted first according to the string-list.c
API.
In this case I think that's fine.
There was further discussion of alternate approaches.
Even though we're being lazy let's assert the line we do expect to get in the test, since whoever changes this code in the future might miss this case, and would want to update the test and comments.
Another way to speed up a repack: --quiet
, which actually works with Git 2.35 (Q1 2022).
See commit 47ca93d, commit e4d0c11 (20 Dec 2021) by Derrick Stolee (derrickstolee
).
(Merged by Junio C Hamano -- gitster
-- in commit 88a516a, 05 Jan 2022)
repack
: make '--quiet' disable progress
Helped-by: Jeff King
Signed-off-by: Derrick Stolee
While testing some ideas in 'git repack
'(man), I ran it with '--quiet
' and discovered that some progress output was still shown.
Specifically, the output for writing the multi-pack-index showed the progress.
The 'show_progress
' variable in cmd_repack()
is initialized with isatty(2)
and is not modified at all by the '--quiet
' flag.
The '--quiet' flag modifies the po_args
.quiet option which is translated into a '--quiet' flag for the 'git pack-objects
'(man) child process.
However, 'show_progress
' is used to directly send progress information to the multi-pack-index writing logic which does not use a child process.
The fix here is to modify 'show_progress
' to be false if po_opts.quiet
is true, and isatty(2)
otherwise.
This new expectation simplifies a later condition that checks both.
Update the documentation to make it clear that '-q' will disable all progress in addition to ensuring the 'git pack-objects
' child process will receive the flag.
git repack
now includes in its man page:
--quiet
Show no progress over the standard error stream and pass the -q
option to 'git pack-objects'. See git pack-objects
.
With Git 2.37 (Q3 2022), teach "git repack --geometric
"(man) to work better with --keep-pack
and avoid corrupting the repository when packsize
limit is used.
See commit 66731ff, commit aab7bea (20 May 2022) by Taylor Blau (ttaylorr
).
See commit 4b5a808 (20 May 2022) by Victoria Dye (vdye
).
(Merged by Junio C Hamano -- gitster
-- in commit 16a0e92, 03 Jun 2022)
repack
: respect --keep-pack with geometric repack
Co-authored-by: Taylor Blau
Signed-off-by: Victoria Dye
Signed-off-by: Taylor Blau
Update 'repack
' to ignore packs named on the command line with the '--keep-pack
' option.
Specifically, modify 'init_pack_geometry()
' to treat command line-kept packs the same way it treats packs with an on-disk '.keep
' file (that is, skip the pack and do not include it in the 'geometry' structure).
Without this handling, a '--keep-pack
' pack would be included in the 'geometry' structure.
If the pack is before the geometry split line (with at least one other pack and/or loose objects present), 'repack
' assumes the pack's contents are "rolled up" into another pack via 'pack-objects
'.
However, because the internally-invoked 'pack-objects' properly excludes '--keep-pack
' objects, any new pack it creates will not contain the kept objects.
Finally, 'repack' deletes the '--keep-pack
' as "redundant" (since it assumes 'pack-objects' created a new pack with its contents), resulting in possible object loss and repository corruption.
Before Git 2.39 (Q4 2022), when geometric repacking feature is in use together with the --pack-kept-objects
option, we lost packs marked with .keep
files.
See commit 197443e (17 Oct 2022) by Taylor Blau (ttaylorr
).
(Merged by Junio C Hamano -- gitster
-- in commit f62c546, 27 Oct 2022)
repack
: don't remove .keep
packs with --pack-kept-objects
Co-authored-by: Victoria Dye
Signed-off-by: Taylor Blau
git repack
(man) supports a --pack-kept-objects
flag which more or less translates to whether or not we pass --honor-pack-keep
down to git pack-objects
(man) when assembling a new pack.
This behavior has existed since ee34a2b ("repack
: add repack.packKeptObjects
config var", 2014-03-03, Git v2.0.0-rc0 -- merge).
In that commit, the documentation was extended to say:
[...] Note that we still do not delete `.keep` packs after
`pack-objects` finishes.
Unfortunately, this is not the case when --pack-kept-objects
is combined with a --geometric
repack.
When doing a geometric repack, we include .keep
packs when enumerating available packs only when pack_kept_objects
is set.
So this all works fine when --no-pack-kept-objects
(or similar) is given.
Kept packs are excluded from the geometric roll-up, so when we go to delete redundant packs (with -d
), no .keep
packs appear "below the split" in our geometric progression.
But when --pack-kept-objects
is given, things can go awry.
Namely, when a kept pack is included in the list of packs tracked by the pack_geometry
struct and part of the pack roll-up, we will delete the .keep
pack when we shouldn't.
Note that this doesn't result in object corruption, since the .keep
pack's objects are still present in the new pack.
But the .keep
pack itself is removed, which violates our promise from back in ee34a2b.
But there's more.
Because repack
computes the geometric roll-up independently from selecting which packs belong in a MIDX (with --write-midx
), this can lead to odd behavior.
Consider when a .keep
pack appears below the geometric split (ie., its objects will be part of the new pack we generate).
We'll write a MIDX containing the new pack along with the existing .keep
pack.
But because the .keep
pack appears below the geometric split line, we'll (incorrectly) try to remove it.
While this doesn't corrupt the repository, it does cause us to remove the MIDX we just wrote, since removing that pack would invalidate the new MIDX.
Funny enough, this behavior became far less noticeable after e4d0c11 ("repack
: respect kept objects with '--write-midx -b'", 2021-12-20, Git v2.35.0-rc0 -- merge listed in batch #7), which made pack_kept_objects
be enabled by default only when we were writing a non-MIDX bitmap.
But e4d0c11 didn't resolve this bug, it just made it harder to notice unless callers explicitly passed --pack-kept-objects
.
The solution is to avoid trying to remove .keep
packs during --geometric
repacks, even when they appear below the geometric split line, which is the approach this patch implements.
With Git 2.39 (Q4 2022), when creating a multi-pack bitmap, remove per-pack bitmap files unconditionally as they will never be consulted.
See commit 55d902c (17 Oct 2022) by Taylor Blau (ttaylorr
).
(Merged by Junio C Hamano -- gitster
-- in commit c5dd777, 28 Oct 2022)
builtin/repack.c
: remove redundant pack-based bitmaps
Signed-off-by: Taylor Blau
When we write a MIDX bitmap after repacking, it is possible that the repository would be left in a state with both pack- and multi-pack reachability bitmaps.
This can occur, for instance, if a pack that was kept (either by having a .keep
file, or during a geometric repack in which it is not rolled up) has a bitmap file, and the repack wrote a multi-pack index and bitmap.
When loading a reachability bitmap for the repository, the multi-pack one is always preferred, so the pack-based one is redundant.
Let's remove it unconditionally, even if '-d
' isn't passed, since there is no practical reason to keep both around.
The patch below does just that.