7

Is there any difference between git gc and git repack -ad; git prune?
If yes, what additional steps will be done by git gc (or vice versa)?
Which one is better to use in regard to space optimization or safety?

Enrico Campidoglio
  • 56,676
  • 12
  • 126
  • 154

3 Answers3

10

Is there any difference between git gc and git repack -ad; git prune?

The difference is that by default git gc is very conservative about what housekeeping tasks are needed. For example, it won't run git repack unless the number of loose objects in the repository is above a certain threshold (configurable via the gc.auto variable). Also, git gc is going to run more tasks than just git repack and git prune.

If yes, what additional steps will be done by git gc (or vice versa)?

According to the documentation, git gc runs:

  • git-prune
  • git-reflog
  • git-repack
  • git-rerere

More specifically, by looking at the source code of gc.c (lines 338-343)1 we can see that it invokes at the most the following commands:

  • pack-refs --all --prune
  • reflog expire --all
  • repack -d -l
  • prune --expire
  • worktree prune --expire
  • rerere gc

Depending on the number of packs (lines 121-126), it may run repack with -A option instead (lines 203-212):

* If there are too many loose objects, but not too many
* packs, we run "repack -d -l". If there are too many packs,
* we run "repack -A -d -l".  Otherwise we tell the caller
* there is no need.
if (too_many_packs())
    add_repack_all_option();
else if (!too_many_loose_objects())
    return 0;

Notice on line 211-212 of the need_for_gc function that if there aren't enough loose objects in the repository, gc is not run at all.

This is further clarified in the documentation:

Housekeeping is required if there are too many loose objects or too many packs in the repository. If the number of loose objects exceeds the value of the gc.auto configuration variable, then all loose objects are combined into a single pack using git repack -d -l. Setting the value of gc.auto to 0 disables automatic packing of loose objects.

If the number of packs exceeds the value of gc.autoPackLimit, then existing packs (except those marked with a .keep file) are consolidated into a single pack by using the -A option of git repack.

As you can see, git gc strives to do the right thing based on the state of the repository.

Which one is better to use in regard to space optimization or safety?

In general it's better to run git gc --auto simply because it will do the least amount of work necessary to keep the repository in good shape – safely and without wasting too many resources.

However, keep in mind that a garbage collection may already be triggered automatically following certain commands, unless this behavior is disabled by the setting the gc.auto configuration variable to 0.

From the documentation:

--auto
With this option, git gc checks whether any housekeeping is required; if not, it exits without performing any work. Some git commands run git gc --auto after performing operations that could create many loose objects.

So for most repositories you shouldn't need to explicitly run git gc all that often, since it will already be taken care of for you.


1. As of commit a0a1831 made on 2016-08-08.

Enrico Campidoglio
  • 56,676
  • 12
  • 126
  • 154
1

git help gc contains a few hints...

The optional configuration variable gc.rerereresolved indicates how long records of conflicted merge you resolved earlier are kept.

The optional configuration variable gc.rerereunresolved indicates how long records of conflicted merge you have not resolved are kept.

I believe those are not done if you only do git repack -ad; git prune.

Community
  • 1
  • 1
AnoE
  • 8,048
  • 1
  • 21
  • 36
0

Note that, which git prune is run by git gc, the former has evolved with Git 2.22 (Q2 2019)

"git prune" has been taught to take advantage of reachability bitmap when able.

See commit cc80c95, commit c2bf473, commit fde67d6, commit d55a30b (14 Feb 2019) by Jeff King (peff).
(Merged by Junio C Hamano -- gitster -- in commit f7213a3, 07 Mar 2019)

prune: use bitmaps for reachability traversal

Pruning generally has to traverse the whole commit graph in order to see which objects are reachable.
This is the exact problem that reachability bitmaps were meant to solve, so let's use them (if they're available, of course).

See reachability bitmap here.

Here are timings on git.git:

Test                            HEAD^             HEAD
------------------------------------------------------------------------
5304.6: prune with bitmaps      3.65(3.56+0.09)   1.01(0.92+0.08) -72.3%

And on linux.git:

Test                            HEAD^               HEAD
--------------------------------------------------------------------------
5304.6: prune with bitmaps      35.05(34.79+0.23)   3.00(2.78+0.21) -91.4%

The tests show a pretty optimal case, as we'll have just repacked and should have pretty good coverage of all refs with our bitmaps.
But that's actually pretty realistic: normally prune is run via "gc" right after repacking.

Notes on the implementation: the change is actually in reachable.c, so it would improve reachability traversals by "reflog expire --stale-fix", as well.
Those aren't performed regularly, though (a normal "git gc" doesn't use --stale-fix), so they're not really worth measuring. There's a low chance of regressing that caller, since the use of bitmaps is totally transparent from the caller's perspective.

And:

See commit fe6f2b0 (18 Apr 2019) by Jeff King (peff).
(Merged by Junio C Hamano -- gitster -- in commit d1311be, 08 May 2019)

prune: lazily perform reachability traversal

The general strategy of "git prune" is to do a full reachability walk, then for each loose object see if we found it in our walk.
But if we don't have any loose objects, we don't need to do the expensive walk in the first place.

This patch postpones that walk until the first time we need to see its results.

Note that this is really a specific case of a more general optimization, which is that we could traverse only far enough to find the object under consideration (i.e., stop the traversal when we find it, then pick up again when asked about the next object, etc).
That could save us in some instances from having to do a full walk. But it's actually a bit tricky to do with our traversal code, and you'd need to do a full walk anyway if you have even a single unreachable object (which you generally do, if any objects are actually left after running git-repack).

So in practice this lazy-load of the full walk catches one easy but common case (i.e., you've just repacked via git-gc, and there's nothing unreachable).

The perf script is fairly contrived, but it does show off the improvement:

 Test                            HEAD^             HEAD
 -------------------------------------------------------------------------
 5304.4: prune with no objects   3.66(3.60+0.05)   0.00(0.00+0.00) -100.0%

and would let us know if we accidentally regress this optimization.

Note also that we need to take special care with prune_shallow(), which relies on us having performed the traversal.
So this optimization can only kick in for a non-shallow repository. Since this is easy to get wrong and is not covered by existing tests, let's add an extra test to t5304 that covers this case explicitly.

prune: use bitmaps for reachability traversal

Pruning generally has to traverse the whole commit graph in order to see which objects are reachable.
This is the exact problem that reachability bitmaps were meant to solve, so let's use them (if they're available, of course).

Here are timings on git.git:

 Test                            HEAD^             HEAD
 ------------------------------------------------------------------------
 5304.6: prune with bitmaps      3.65(3.56+0.09)   1.01(0.92+0.08) -72.3%

And on linux.git:

 Test                            HEAD^               HEAD
 --------------------------------------------------------------------------
 5304.6: prune with bitmaps      35.05(34.79+0.23)   3.00(2.78+0.21) -91.4%

The tests show a pretty optimal case, as we'll have just repacked and should have pretty good coverage of all refs with our bitmaps.
But that's actually pretty realistic: normally prune is run via "gc" right after repacking.

A few notes on the implementation:

  • the change is actually in reachable.c, so it would improve reachability traversals by "reflog expire --stale-fix", as well.
    Those aren't performed regularly, though (a normal "git gc" doesn't use --stale-fix), so they're not really worth measuring.
    There's a low chance of regressing that caller, since the use of bitmaps is totally transparent from the caller's perspective.

  • The bitmap case could actually get away without creating a "struct object", and instead the caller could just look up each object id in the bitmap result. However, this would be a marginal improvement in runtime, and it would make the callers much more complicated.
    They'd have to handle both the bitmap and non-bitmap cases separately, and in the case of git-prune, we'd also have to tweak prune_shallow(), which relies on our SEEN flags.

  • Because we do create real object structs, we go through a few contortions to create ones of the right type.
    This isn't strictly necessary (lookup_unknown_object() would suffice), but it's more memory efficient to use the correct types, since we already know them.


When the reachability bitmap is in effect (since Git 2.22, 2019), the "do not lose recently created objects and those that are reachable from them" safety to protect us from races were disabled by mistake: That has been corrected with Git 2.32 (Q2 2021).

See commit 2ba582b, commit 1e951c6 (28 Apr 2021) by Jeff King (peff).
(Merged by Junio C Hamano -- gitster -- in commit 6e08cbd, 07 May 2021)

prune: save reachable-from-recent objects with bitmaps

Reported-by: David Emett
Signed-off-by: Jeff King

We pass our prune expiration to mark_reachable_objects(), which will traverse not only the reachable objects, but consider any recent ones as tips for reachability; see d3038d2 ("prune: keep objects reachable from recent objects", 2014-10-15, Git v2.2.0-rc0 -- merge) for details.

However, this interacts badly with the bitmap code path added in fde67d6 ("prune: use bitmaps for reachability traversal", 2019-02-13, Git v2.22.0-rc0 -- merge listed in batch #2).
If we hit the bitmap-optimized path, we return immediately to avoid the regular traversal, accidentally skipping the "also traverse recent" code.

Instead, we should do an if-else for the bitmap versus regular traversal, and then follow up with the "recent" traversal in either case.
This reuses the "rev_info" for a bitmap and then a regular traversal, but that should work OK (the bitmap code clears the pending array in the usual way, just like a regular traversal would).

VonC
  • 1,262,500
  • 529
  • 4,410
  • 5,250