In general, "git gc
" may delete objects that another concurrent process
is using but hasn't created a reference to.
Git 2.12 (Q1 2017) has more on this.
See commit f1350d0 (15 Nov 2016) by Matt McCutchen (mattmccutchen
).
(Merged by Junio C Hamano -- gitster
-- in commit 979b82f, 10 Jan 2017)
And see Jeff King's comment:
Modern versions of git do two things to help with this:
any object which is referenced by a "recent" object (within the 2
weeks) is also considered recent. So if you create a new commit
object that points to a tree, even before you reference the commit
that tree is protected
when an object write is optimized out because we already have the
object, git will update the mtime on the file (loose object or
packfile) to freshen it
This isn't perfect, though. You can decide to reference an existing
object just as it is being deleted. And the pruning process itself is
not atomic (and it's tricky to make it so, just because of what we're
promised by the filesystem).
If you have long-running data (like, a temporary index file that might
literally sit around for days or weeks) I think that is a potential
problem. And the solution is probably to use refs in some way to point
to your objects.
If you're worried about a short-term operation where
somebody happens to run git-gc
concurrently, I agree it's a possible
problem, but I suspect something you can ignore in practice.
For a busy multi-user server, I recommend turning off auto-gc entirely,
and repacking manually with "-k
" to be on the safe side.
This is why git gc
man page now includes:
On the other hand, when 'git gc
' runs concurrently with another process,
there is a risk of it deleting an object that the other process is using
but hasn't created a reference to. This may just cause the other process
to fail or may corrupt the repository if the other process later adds a
reference to the deleted object.
Git has two features that significantly mitigate this problem:
Any object with modification time newer than the --prune
date is kept,
along with everything reachable from it.
Most operations that add an object to the database update the
modification time of the object if it is already present so that #1
applies.
However, these features fall short of a complete solution, so users who
run commands concurrently have to live with some risk of corruption (which
seems to be low in practice) unless they turn off automatic garbage
collection with 'git config gc.auto 0'.
Note on that last sentence including "unless they turn off automatic garbage": Git 2.22 (Q2 2019) amend the gc documentation.
See commit 0044f77, commit daecbf2, commit 7384504, commit 22d4e3b, commit 080a448, commit 54d56f5, commit d257e0f, commit b6a8d09 (07 Apr 2019), and commit fc559fb, commit cf9cd77, commit b11e856 (22 Mar 2019) by Ævar Arnfjörð Bjarmason (avar
).
(Merged by Junio C Hamano -- gitster
-- in commit ac70c53, 25 Apr 2019)
gc
docs: remove incorrect reference to gc.auto=0
The chance of a repository being corrupted due to a "gc
" has nothing
to do with whether or not that "gc" was invoked via "gc --auto
", but
whether there's other concurrent operations happening.
This is already noted earlier in the paragraph, so there's no reason to suggest this here. The user can infer from the rest of the documentation that "gc
" will run automatically unless gc.auto=0
is set, and we shouldn't confuse the issue by implying that "gc --auto
" is somehow more prone to produce corruption than a normal "gc
".
Well, it is in the sense that a blocking "gc
" would stop you from
doing anything else in that particular terminal window, but users
are likely to have another window, or to be worried about how
concurrent "gc
" on a server might cause corruption.