18

I'm experimenting with fairly aggressive auto gc in Git, mainly for packing purposes. In my repos if I do git config --list I have setup

...
gc.auto=250
gc.autopacklimit=30
...

If I do git count-objects -v I get

count: 376
size: 1251
in-pack: 2776
packs: 1
size-pack: 2697
prune-packable: 0
garbage: 0

But git gc --auto doesn't change these figures, nothing is being packed! shouldn't the loose objects get packed since I'm 126 objects over the gc.auto limit?

user826840
  • 1,233
  • 2
  • 13
  • 28
  • Perhaps a significant portion of those loose objects are in fact dangling? Try `git gc --auto --prune=now` and/or `git fsck --full`... – twalberg May 02 '13 at 14:52
  • `git fsck --dangling` gives just 3 dandling commits. I haven't done any rebasing or anything fancy since my last full GC. I tried `--auto --prune=now`, no change – user826840 May 02 '13 at 14:56
  • I wonder how many trees and blobs are unique to those 3 commits (i.e. not referenced by any other non-dangling commit). Can't think of an easy way to figure that out, though, other than a lot of `git cat-file ...` and `git ls-tree ...` shenanigans... – twalberg May 02 '13 at 15:06
  • `fsck --dangling` gives 13 dangling blobs, 3 commits and 1 tag. My usage recently has been very linear, I can't believe that's the problem. I now have 402 loose objects – user826840 May 02 '13 at 15:59

3 Answers3

46

One of the main points of gc --auto is that it should be very quick, so other commands can frequently call it “just in case”. To achieve that, the object count is only guessed. As git help config says under gc.auto:

When there are approximately more than this many loose objects in the repository […]

Looking at the code (too_many_loose_objects() in buildin/gc.c), here’s what happens:

  1. The gc.auto is divided by 256 and rounded up
  2. The folder that contains all the objects that start with 17 is opened
  3. It is checked if the folder contains more objects than the result of step 1

This works fine, since SHA-1 is evenly distributed, so “all the objects that start with X” is representative for the whole set. But of course this only works for a big big amount of objects. To lazy to do the maths, I would guess at least >3000. With 6700 (the default value of gc.auto), this should already work quite reliably.

The core question for me is why you need such a low setting and whether it is important that this really runs at 250 objects. With a setting of 250, gc will run as soon as you have 2 loose objects that start with 17. The chance that this happens is > 80% for 600 objects and > 90% for 800 objects.

Update: Couldn’t help it – had to do the math :). I was wondering how well that estimation system would work. Here’s a plot of the results. For any given gc.auto, how high is the probability that gc will start when there are gc.auto (red) / gc.auto * 1.1 (green) / gc.auto * 1.2 (orange) / gc.auto * 1.5 (blue) / gc.auto * 2 (purple) loose objects in the repo?

Plot of the results

Chronial
  • 66,706
  • 14
  • 93
  • 99
  • That's great, thanks! The value 250 wasn't important, I was just trying to understand why it was not running. Could you point me to the file and line in the source where you found this? Thanks – user826840 May 02 '13 at 16:27
  • Found it, `builtin/gc.c` in the function `too_many_loose_objects()`. it does `(gc_auto_threshold + 255) / 256` which for me would yield 1.97, rounding down to 1 if I remember how C integer division works. And it gc's when the number of loose objects in /17/ exceeds this value. With the default setting the count would need to exceed 28. Thanks again for your help. – user826840 May 02 '13 at 16:46
  • 1
    Could you tell me, how did you count this: *The chance that this happens is > 80% for 600 objects and > 90% for 800 objects.* ? – codevolution Mar 21 '16 at 13:23
  • 1
    @codevolution If you assume that the first byte of sha-1 is uniformly distributed, you have a 1/256 chance that any given sha is in the folder in question. This then becomes in a binomial distribution and you can use any tool that can give you that CDF. I can only get WA to plot this at the moment: http://www.wolframalpha.com/input/?i=plot+CDF+Binomial%28+n%3D600,+p%3D1%2F256%29+with+x+in+0..4 You can see in the plot that P(X≤1) = 0.1 ⇒ P(X≥2) = 0.9. – Chronial Mar 21 '16 at 20:53
  • @Chronial This type of info is enlightening. How did you collect the data for your chart? How did you create the chart? I'm not familiar with wolframalpha, other than knowing that it exists. – GaTechThomas Jun 23 '17 at 15:46
  • 1
    @GaTechThomas I used Matlab, which is a tool to do mathematical calculations. There I set up the formula for the probability and just ran that in a loop to get the value for each `gc.auto` value. I think I also created the chart in Matlab, but it's been a while ^^. – Chronial Jun 27 '17 at 13:01
1

Note that gc auto is be more robust in Git 2.12.2 (released March 2017, two days ago).

See commit a831c06 (10 Feb 2017) by David Turner (csusbdt).
Helped-by: Jeff King (peff).
(Merged by Junio C Hamano -- gitster -- in commit d30ec1b, 21 Mar 2017)

gc: ignore old gc.log files

A server can end up in a state where there are lots of unreferenced loose objects (say, because many users are doing a bunch of rebasing and pushing their rebased branches).
Running "git gc --auto" in this state would cause a gc.log file to be created, preventing future auto gcs, causing pack files to pile up.
Since many git operations are O(n) in the number of pack files, this would lead to poor performance.

Git should never get itself into a state where it refuses to do any maintenance, just because at some point some piece of the maintenance didn't make progress.

Teach Git to ignore gc.log files which are older than (by default) one day old, which can be tweaked via the gc.logExpiry configuration variable.
That way, these pack files will get cleaned up, if necessary, at least once per day. And operators who find a need for more-frequent gcs can adjust gc.logExpiry to meet their needs.


Note: since Git 2.17 (Q2 2018), git gc --auto will run on each git commit too.
See "List of all commands that cause git gc --auto".

And there is a pre-gc --auto hook associated to that command too.

VonC
  • 1,262,500
  • 529
  • 4,410
  • 5,250
0

This helped me:

git config --global gc.auto 0

https://git-scm.com/docs/git-gc/2.6.7

Rafael Shepard
  • 214
  • 3
  • 15