10

A while ago I posted a question asking for feedback on a plan to fix a repo that was slow because of a lot of big, binary files. That question (not a must-read for this one): Fixing up a git repo that is slowed because of big binary files

I followed through with my plan, and experienced unexpected side results.

A fresh clone of our repo originally took 2-3 hours. Figured out that the server started swapping, and after doing git config pack.windowMemory 100m && git config pack.packSizeLimit 200m, the clone time sank to ~15 minutes.

I figured I'd still do the rest of my plan, so I disabled delta compresson for the binary types we have, and ran git repack -a -d -F on the repo.

After this, a fresh clone of the repo takes ~20 minutes, so it actually got worse. But the real problem is that every time someone that had already cloned the repo tries to push commits they get "Auto packing the repository for optimum performance.".

Any ideas on what might have gone wrong, and how it can/should be fixed?

Community
  • 1
  • 1
anr78
  • 1,308
  • 2
  • 17
  • 31
  • I only read this now, but: *disabled delta compresson for the binary types we have, and ran git repack -a -d -F on the repo*, don't you mean `-f` rather than `-F`? – clacke Aug 18 '17 at 02:20

3 Answers3

5

Probably the size of your repo and your low value for pack.packSizeLimit makes the number of packs always be above gc.autopacklimit. So increase either of them to make sure the gc doesn't run each commit.

I'm not sure in what ways packSizeLimit would affect memory, but I don't believe it would have any significant effect. Please correct me if your experiments show otherwise. The parameters that directly affect memory use are pack.windowMemory and pack.deltaCacheSize.

clacke
  • 7,688
  • 6
  • 46
  • 48
  • Figured this out some time ago but forgot this post. Your answer is correct. It repacked as best it could, and always ended up with too many packs. Will try to remove the packSizeLimit and check memory usage. I suspect you're right :) – anr78 Nov 16 '13 at 08:13
2

You might want to retry the same git repack, this timewith Git 2.32 (Q2 2021, almost 8 years later), and its new --geometric=<n> option:

"git repack"(man) so far has been only capable of repacking everything under the sun into a single pack (or split by size).

A cleverer strategy to reduce the cost of repacking a repository has been introduced.

See commit 14e7b83 (19 Mar 2021), commit 2a15964, commit 13d746a, commit dab3247, commit f25e33c (05 Mar 2021), and commit 0fabafd, commit 339bce2, commit c9fff00, commit f62312e (22 Feb 2021) by Taylor Blau (ttaylorr).
See commit ccae01c (05 Mar 2021) by Junio C Hamano (gitster).
See commit 20b031f, commit 6325da1, commit fbf20ae, commit 60bb5f2 (22 Feb 2021) by Jeff King (peff).
(Merged by Junio C Hamano -- gitster -- in commit 2744383, 24 Mar 2021)


First: find_kept_pack_entry()

packfile: introduce 'find_kept_pack_entry()'

Co-authored-by: Jeff King
Signed-off-by: Jeff King
Signed-off-by: Taylor Blau
Reviewed-by: Jeff King

Future callers will want a function to fill a 'struct pack_entry' for a given object id but only from its position in any kept pack(s).

In particular, an new 'git repack'(man) mode which ensures the resulting packs form a geometric progress by object count will mark packs that it does not want to repack as "kept in-core", and it will want to halt a reachability traversal as soon as it visits an object in any of the kept packs.
But, it does not want to halt the traversal at non-kept, or .keep packs.

The obvious alternative is 'find_pack_entry()', but this doesn't quite suffice since it only returns the first pack it finds, which may or may not be kept (and the mru cache makes it unpredictable which one you'll get if there are options).

Short of that, you could walk over all packs looking for the object in each one, but it scales with the number of packs, which may be prohibitive.

Introduce 'find_kept_pack_entry()', a function which is like 'find_pack_entry()', but only fills in objects in the kept packs.


Then: git pack-objects --stdin-packs

builtin/pack-objects.c: add '--stdin-packs' option

Suggested-by: Jeff King
Signed-off-by: Taylor Blau
Reviewed-by: Jeff King

In an upcoming commit, 'git repack'(man) will want to create a pack comprised of all of the objects in some packs (the included packs) excluding any objects in some other packs (the excluded packs).

This caller could iterate those packs themselves and feed the objects it finds to 'git pack-objects'(man) directly over stdin, but this approach has a few downsides:

  • It requires every caller that wants to drive 'git pack-objects' in this way to implement pack iteration themselves.
    This forces the caller to think about details like what order objects are fed to pack-objects, which callers would likely rather not do.
  • If the set of objects in included packs is large, it requires sending a lot of data over a pipe, which is inefficient.
  • The caller is forced to keep track of the excluded objects, too, and make sure that it doesn't send any objects that appear in both included and excluded packs.

But the biggest downside is the lack of a reachability traversal.
Because the caller passes in a list of objects directly, those objects don't get a namehash assigned to them, which can have a negative impact on the delta selection process, causing 'git pack-objects' to fail to find good deltas even when they exist.

The caller could formulate a reachability traversal themselves, but the only way to drive 'git pack-objects' in this way is to do a full traversal, and then remove objects in the excluded packs after the traversal is complete.
This can be detrimental to callers who care about performance, especially in repositories with many objects.

Introduce 'git pack-objects --stdin-packs'(man) which remedies these four concerns.

'git pack-objects --stdin-packs' expects a list of pack names on stdin, where 'pack-xyz.pack' denotes that pack as included, and '^pack-xyz.pack' denotes it as excluded.
The resulting pack includes all objects that are present in at least one included pack, and aren't present in any excluded pack.

git pack-objects now includes in its man page:

--stdin-packs

Read the basenames of packfiles (e.g., pack-1234abcd.pack) from the standard input, instead of object names or revision arguments.

The resulting pack contains all objects listed in the included packs (those not beginning with ^), excluding any objects listed in the excluded packs (beginning with ^).

Incompatible with --revs, or options that imply --revs (such as --all), with the exception of --unpacked, which is compatible.


Finally: git repack --geometric=<n>

builtin/repack.c: add '--geometric' option

Suggested-by: Derrick Stolee
Signed-off-by: Taylor Blau
Reviewed-by: Jeff King

Often it is useful to both:

  • have relatively few packfiles in a repository, and
  • avoid having so few packfiles in a repository that we repack its entire contents regularly

This patch implements a '--geometric=' option in 'git repack'(man).
This allows the caller to specify that they would like each pack to be at least a factor times as large as the previous largest pack (by object count).

Concretely, say that a repository has 'n' packfiles, labeled P1, P2, ..., up to Pn.
Each packfile has an object count equal to 'objects(Pn)'.
With a geometric factor of 'r', it should be that:

objects(Pi) > r*objects(P(i-1))

for all i in [1, n], where the packs are sorted by

objects(P1) <= objects(P2) <= ... <= objects(Pn).

Since finding a true optimal repacking is NP-hard, we approximate it along two directions:

  1. We assume that there is a cutoff of packs before starting the repack where everything to the right of that cut-off already forms a geometric progression (or no cutoff exists and everything must be repacked).

  2. We assume that everything smaller than the cutoff count must be repacked.
    This forms our base assumption, but it can also cause even the "heavy" packs to get repacked, for e.g., if we have 6 packs containing the following number of objects:

    1, 1, 1, 2, 4, 32

then we would place the cutoff between '1, 1' and '1, 2, 4, 32', rolling up the first two packs into a pack with 2 objects.
That breaks our progression and leaves us:

2, 1, 2, 4, 32
  ^

(where the '^' indicates the position of our split).
To restore a progression, we move the split forward (towards larger packs) joining each pack into our new pack until a geometric progression is restored.
Here, that looks like:

2, 1, 2, 4, 32  ~>  3, 2, 4, 32  ~>  5, 4, 32  ~> ... ~> 9, 32
  ^                   ^                ^                   ^

This has the advantage of not repacking the heavy-side of packs too often while also only creating one new pack at a time.
Another wrinkle is that we assume that loose, indexed, and reflog'd objects are insignificant, and lump them into any new pack that we create.
This can lead to non-idempotent results.

git repack now includes in its man page:

-g=<factor>

--geometric=<factor>

Arrange resulting pack structure so that each successive pack contains at least <factor> times the number of objects as the next-largest pack.

git repack ensures this by determining a "cut" of packfiles that need to be repacked into one in order to ensure a geometric progression. It picks the smallest set of packfiles such that as many of the larger packfiles (by count of objects contained in that pack) may be left intact.

Unlike other repack modes, the set of objects to pack is determined uniquely by the set of packs being "rolled-up"; in other words, the packs determined to need to be combined in order to restore a geometric progression.

When --unpacked is specified, loose objects are implicitly included in this "roll-up", without respect to their reachability. This is subject to change in the future. This option (implying a drastically different repack mode) is not guaranteed to work with all other combinations of option to git repack).


Results:

  Test                                        HEAD^                   HEAD
  -----------------------------------------------------------------------------------------------
  5303.5: repack (1)                          57.34(54.66+10.88)      56.98(54.36+10.98) -0.6%
  5303.6: repack with kept (1)                57.38(54.83+10.49)      57.17(54.97+10.26) -0.4%
  5303.11: repack (50)                        71.70(88.99+4.74)       71.62(88.48+5.08) -0.1%
  5303.12: repack with kept (50)              72.58(89.61+4.78)       71.56(88.80+4.59) -1.4%
  5303.17: repack (1000)                      217.19(491.72+14.25)    217.31(490.82+14.53) +0.1%
  5303.18: repack with kept (1000)            246.12(520.07+14.93)    217.08(490.37+15.10) -11.8%

and the --stdin-packs case, which scales a little bit better (although not by that much even at 1,000 packs):

  5303.7: repack with --stdin-packs (1)       0.00(0.00+0.00)         0.00(0.00+0.00) =
  5303.13: repack with --stdin-packs (50)     3.43(11.75+0.24)        3.43(11.69+0.30) +0.0%
  5303.19: repack with --stdin-packs (1000)   130.50(307.15+7.66)     125.13(301.36+8.04) -4.1%

See "Git 2.33: Geometric repacking" from Taylor Blau:

Historically, git repack did one of two things:

  • it either repacked all loose objects into a new pack (optionally deleting the loose copies of each of those objects),
  • or it repacked all packs together into a single, new pack (optionally deleting the redundant packs).

Generally speaking, Git has better performance when there are fewer packs, since many operations scale with the number of packs in a repository.
So it’s often a good idea to pack everything together into one single pack. > But historically speaking, busy repositories often require that all of their contents be packed together into a single, enormous pack.
That’s because reachability bitmaps, which are a critical optimization for server-side Git performance, can only describe the objects in a single pack.
So if you want to use bitmaps to effectively cover many objects in your repository, those objects have to be stored together in the same pack.

We’re working toward removing that limitation (you can read more about how we’ve done that), but one important step along the way is to implement a new repacking scheme that trades off between having a relatively small number of packs, and packing together recently added objects (in other words, approximating the new objects added since the previous repack).

To do that, Git learned a new “geometric” repacking strategy.

The idea is to determine a (small-ish) set of packs which could be combined together so that the remaining packs form a geometric progression based on object size.
In other words, if the smallest pack has N objects, then the next-largest pack would have at least 2N objects, and so on, doubling (or growing by an arbitrary constant) at each step along the way.

https://github.blog/wp-content/uploads/2021/04/Scaling-monorepo-maintenance-fig-12.png?w=1958


Input validation of "git pack-objects --stdin-packs"(man) has been corrected with Git 2.34 (Q4 2021).

See commit 561fa03 (09 Jul 2021), and commit fb20d4b (21 Jun 2021) by Ævar Arnfjörð Bjarmason (avar).
(Merged by Junio C Hamano -- gitster -- in commit 5c933f0, 24 Aug 2021)

pack-objects: fix segfault in --stdin-packs option

Signed-off-by: Ævar Arnfjörð Bjarmason

Fix a segfault in the --stdin-packs option added in 339bce2 ("builtin/pack-objects.c: add '--stdin-packs' option", 2021-02-22, Git v2.32.0-rc0 -- merge listed in batch #4).

The read_packs_list_from_stdin() function didn't check that the lines it was reading were valid packs, and thus when doing the QSORT() with pack_mtime_cmp() we'd have a NULL "util" field.
The "util" field is used to associate the names of included/excluded packs with the packed_git structs they correspond to.

The logic error was in assuming that we could iterate all packs and annotate the excluded and included packs we got, as opposed to checking the lines we got on stdin.
There was a check for excluded packs, but included packs were simply assumed to be valid.

As noted in the test we'll not report the first bad line, but whatever line sorted first according to the string-list.c API.
In this case I think that's fine.
There was further discussion of alternate approaches.

Even though we're being lazy let's assert the line we do expect to get in the test, since whoever changes this code in the future might miss this case, and would want to update the test and comments.


Another way to speed up a repack: --quiet, which actually works with Git 2.35 (Q1 2022).

See commit 47ca93d, commit e4d0c11 (20 Dec 2021) by Derrick Stolee (derrickstolee).
(Merged by Junio C Hamano -- gitster -- in commit 88a516a, 05 Jan 2022)

repack: make '--quiet' disable progress

Helped-by: Jeff King
Signed-off-by: Derrick Stolee

While testing some ideas in 'git repack'(man), I ran it with '--quiet' and discovered that some progress output was still shown.
Specifically, the output for writing the multi-pack-index showed the progress.

The 'show_progress' variable in cmd_repack() is initialized with isatty(2) and is not modified at all by the '--quiet' flag.
The '--quiet' flag modifies the po_args.quiet option which is translated into a '--quiet' flag for the 'git pack-objects'(man) child process.
However, 'show_progress' is used to directly send progress information to the multi-pack-index writing logic which does not use a child process.

The fix here is to modify 'show_progress' to be false if po_opts.quiet is true, and isatty(2) otherwise.
This new expectation simplifies a later condition that checks both.

Update the documentation to make it clear that '-q' will disable all progress in addition to ensuring the 'git pack-objects' child process will receive the flag.

git repack now includes in its man page:

--quiet

Show no progress over the standard error stream and pass the -q option to 'git pack-objects'. See git pack-objects.


With Git 2.37 (Q3 2022), teach "git repack --geometric"(man) to work better with --keep-pack and avoid corrupting the repository when packsize limit is used.

See commit 66731ff, commit aab7bea (20 May 2022) by Taylor Blau (ttaylorr).
See commit 4b5a808 (20 May 2022) by Victoria Dye (vdye).
(Merged by Junio C Hamano -- gitster -- in commit 16a0e92, 03 Jun 2022)

repack: respect --keep-pack with geometric repack

Co-authored-by: Taylor Blau
Signed-off-by: Victoria Dye
Signed-off-by: Taylor Blau

Update 'repack' to ignore packs named on the command line with the '--keep-pack' option.
Specifically, modify 'init_pack_geometry()' to treat command line-kept packs the same way it treats packs with an on-disk '.keep' file (that is, skip the pack and do not include it in the 'geometry' structure).

Without this handling, a '--keep-pack' pack would be included in the 'geometry' structure.
If the pack is before the geometry split line (with at least one other pack and/or loose objects present), 'repack' assumes the pack's contents are "rolled up" into another pack via 'pack-objects'.
However, because the internally-invoked 'pack-objects' properly excludes '--keep-pack' objects, any new pack it creates will not contain the kept objects.
Finally, 'repack' deletes the '--keep-pack' as "redundant" (since it assumes 'pack-objects' created a new pack with its contents), resulting in possible object loss and repository corruption.


Before Git 2.39 (Q4 2022), when geometric repacking feature is in use together with the --pack-kept-objects option, we lost packs marked with .keep files.

See commit 197443e (17 Oct 2022) by Taylor Blau (ttaylorr).
(Merged by Junio C Hamano -- gitster -- in commit f62c546, 27 Oct 2022)

repack: don't remove .keep packs with --pack-kept-objects

Co-authored-by: Victoria Dye
Signed-off-by: Taylor Blau

git repack(man) supports a --pack-kept-objects flag which more or less translates to whether or not we pass --honor-pack-keep down to git pack-objects(man) when assembling a new pack.

This behavior has existed since ee34a2b ("repack: add repack.packKeptObjects config var", 2014-03-03, Git v2.0.0-rc0 -- merge).
In that commit, the documentation was extended to say:

[...] Note that we still do not delete `.keep` packs after
`pack-objects` finishes.

Unfortunately, this is not the case when --pack-kept-objects is combined with a --geometric repack.
When doing a geometric repack, we include .keep packs when enumerating available packs only when pack_kept_objects is set.

So this all works fine when --no-pack-kept-objects (or similar) is given.
Kept packs are excluded from the geometric roll-up, so when we go to delete redundant packs (with -d), no .keep packs appear "below the split" in our geometric progression.

But when --pack-kept-objects is given, things can go awry.
Namely, when a kept pack is included in the list of packs tracked by the pack_geometry struct and part of the pack roll-up, we will delete the .keep pack when we shouldn't.

Note that this doesn't result in object corruption, since the .keep pack's objects are still present in the new pack.
But the .keep pack itself is removed, which violates our promise from back in ee34a2b.

But there's more.
Because repack computes the geometric roll-up independently from selecting which packs belong in a MIDX (with --write-midx), this can lead to odd behavior.
Consider when a .keep pack appears below the geometric split (ie., its objects will be part of the new pack we generate).

We'll write a MIDX containing the new pack along with the existing .keep pack.
But because the .keep pack appears below the geometric split line, we'll (incorrectly) try to remove it.
While this doesn't corrupt the repository, it does cause us to remove the MIDX we just wrote, since removing that pack would invalidate the new MIDX.

Funny enough, this behavior became far less noticeable after e4d0c11 ("repack: respect kept objects with '--write-midx -b'", 2021-12-20, Git v2.35.0-rc0 -- merge listed in batch #7), which made pack_kept_objects be enabled by default only when we were writing a non-MIDX bitmap.

But e4d0c11 didn't resolve this bug, it just made it harder to notice unless callers explicitly passed --pack-kept-objects.

The solution is to avoid trying to remove .keep packs during --geometric repacks, even when they appear below the geometric split line, which is the approach this patch implements.


With Git 2.39 (Q4 2022), when creating a multi-pack bitmap, remove per-pack bitmap files unconditionally as they will never be consulted.

See commit 55d902c (17 Oct 2022) by Taylor Blau (ttaylorr).
(Merged by Junio C Hamano -- gitster -- in commit c5dd777, 28 Oct 2022)

builtin/repack.c: remove redundant pack-based bitmaps

Signed-off-by: Taylor Blau

When we write a MIDX bitmap after repacking, it is possible that the repository would be left in a state with both pack- and multi-pack reachability bitmaps.

This can occur, for instance, if a pack that was kept (either by having a .keep file, or during a geometric repack in which it is not rolled up) has a bitmap file, and the repack wrote a multi-pack index and bitmap.

When loading a reachability bitmap for the repository, the multi-pack one is always preferred, so the pack-based one is redundant.
Let's remove it unconditionally, even if '-d' isn't passed, since there is no practical reason to keep both around.
The patch below does just that.

VonC
  • 1,262,500
  • 529
  • 4,410
  • 5,250
0

What is being repacked? The local clone, the remote? Remember that local and remote are separate, it probably means that you are getting unpacked stuff that then is being packed locally... need to replicate the remote's cofiguration locally (and try again, i might be completely off base).

vonbrand
  • 11,412
  • 8
  • 32
  • 52
  • If you get *Auto packing the repository for optimum performance* when you push, that's the remote repo. – clacke Aug 18 '17 at 02:31