3

From what I understood any object gets garbage collected when they have no refs. What is the best way to prevent collection of objects that we want to persist in the database?

A use case is when in a pull request one makes changes (maybe according to a code review) and previous commits become detached, they are not going to be merged in the repository but they should always be available in order to allow tracing of the changes in the pull request.

Example:

  • CommitA fixes a bug
  • Create a pull-request for it
  • Somebody reviews and suggests a change, linking to a specific line in the code
  • Change code, amend CommitA and re-commit as CommitA2

Now CommitA2 is what will be in the change history, but the pull-request will still have a link pointing to the old CommitA. In some years we want to be able to see what the pull-request was about and what its comments were referring to.

How does one prevent the commit from being collected by GC?

Give it a tag is the first solution that comes to my mind.

Kamafeather
  • 8,663
  • 14
  • 69
  • 99
  • 1
    I'm pretty sure you want to keep the commit reachable. Give it, or any of it's descendants, a ref. I litter my local repos with non-shared branch names to keep commits around. – evolutionxbox Oct 08 '18 at 14:16
  • yes, using tags is the best solution because if it's about archiving commits, you don't want to use a branch ref that could be updated by mistake. – Philippe Oct 08 '18 at 14:33

2 Answers2

2

With Git 2.42 (Q3 2023), "git pack-objects"(man) learned to invoke a new hook program that enumerates extra objects to be used as anchoring points to keep otherwise unreachable objects in cruft packs.

In other words, you can record the objects you do not want to see gone after a git gc:

cat refs/pull/137/head > ./precious-objects
pr137=$(cat refs/pull/137/head)
rm -Rf refs/pull/137

git config gc.recentObjectsHook ./precious-objects
git prune --expire=now
git show -p ${pr137}

See commit 4dc16e2, commit 01e9ca4 (07 Jun 2023) by Taylor Blau (ttaylorr).
(Merged by Junio C Hamano -- gitster -- in commit 58ecb2e, 23 Jun 2023)

gc: introduce gc.recentObjectsHook

Helped-by: Jeff King
Signed-off-by: Taylor Blau

This patch introduces a new multi-valued configuration option, gc.recentObjectsHook as a means to mark certain objects as recent (and thus exempt from garbage collection), regardless of their age.

When performing a garbage collection operation on a repository with unreachable objects, Git makes its decision on what to do with those object(s) based on how recent the objects are or not.
Generally speaking, unreachable-but-recent objects stay in the repository, and older objects are discarded.

However, we have no convenient way to keep certain precious, unreachable objects around in the repository, even if they have aged out and would be pruned.
Our options today consist of:

  • Point references at the reachability tips of any objects you consider precious, which may be undesirable or infeasible if there are many such objects.

  • Track them via the reflog, which may be undesirable since the reflog's lifetime is limited to that of the reference it's tracking (and callers may want to keep those unreachable objects around for longer).

  • Extend the grace period, which may keep around other objects that the caller does want to discard.

  • Manually modify the mtimes of objects you want to keep.
    If those objects are already loose, this is easy enough to do (you can just enumerate and touch -m each one).

    But if they are packed, you will either end up modifying the mtimes of all objects in that pack, or be forced to write out a loose copy of that object, both of which may be undesirable. Even worse, if they are in a cruft pack, that requires modifying its *.mtimes file by hand, since there is no exposed plumbing for this.

  • Force the caller to construct the pack of objects they want to keep themselves, and then mark the pack as kept by adding a ".keep" file.
    This works, but is burdensome for the caller, and having extra packs is awkward as you roll forward your cruft pack.

This patch introduces a new option to the above list via the gc.recentObjectsHook configuration, which allows the caller to specify a program (or set of programs) whose output is treated as a set of objects to treat as recent, regardless of their true age.

The implementation is straightforward.
Git enumerates recent objects via add_unseen_recent_objects_to_traversal(), which enumerates loose and packed objects, and eventually calls add_recent_object() on any objects for which want_recent_object()'s conditions are met.

This patch modifies the recency condition from simply "is the mtime of this object more recent than the cutoff?" to "[...] or, is this object mentioned by at least one gc.recentObjectsHook?".

Depending on whether or not we are generating a cruft pack, this allows the caller to do one of two things:

  • If generating a cruft pack, the caller is able to retain additional objects via the cruft pack, even if they would have otherwise been pruned due to their age.
  • If not generating a cruft pack, the caller is likewise able to retain additional objects as loose.

A potential alternative here is to introduce a new mode to alter the contents of the reachable pack instead of the cruft one.
One could imagine a new option to pack-objects, say --extra-reachable-tips that does the same thing as above, adding the visited set of objects along the traversal to the pack.

But this has the unfortunate side-effect of altering the reachability closure of that pack.
If parts of the unreachable object graph mentioned by one or more of the "extra reachable tips" programs is not closed, then the resulting pack won't be either.
This makes it impossible in the general case to write out reachability bitmaps for that pack, since closure is a requirement there.

Instead, keep these unreachable objects in the cruft pack (or set of unreachable, loose objects) instead, to ensure that we can continue to have a pack containing just reachable objects, which is always safe to write a bitmap over.

git config now includes in its man page:

gc.recentObjectsHook

When considering whether or not to remove an object (either when generating a cruft pack or storing unreachable objects as loose), use the shell to execute the specified command(s).

Interpret their output as object IDs which Git will consider as "recent", regardless of their age. By treating their mtimes as "now", any objects (and their descendants) mentioned in the output will be kept regardless of their true age.

Output must contain exactly one hex object ID per line, and nothing else. Objects which cannot be found in the repository are ignored. Multiple hooks are supported, but all must exit successfully, else the operation (either generating a cruft pack or unpacking unreachable objects) will be halted.

VonC
  • 1,262,500
  • 529
  • 4,410
  • 5,250
  • See also [commit 4c237d2](https://github.com/git/git/commit/4c237d2ca2b5aa1f8b79e9ac5e1afa907436fbfe) and its [merged commit 25d5952](https://github.com/git/git/commit/25d59524bbc79b4f4560fca1b9b22c8c36780636). – VonC Jul 01 '23 at 12:02
1

Refs don't have to be branches or tags, you can keep local refs to anything you want.

Here's a simple "make me another snapshot ref for pull 137",

next=$((`git rev-list --no-walk --count --glob=refs/snap/pull/137/head-v*`+1))
git update-ref refs/snap/pull/137/head-v$next refs/pull/137/head
jthill
  • 55,082
  • 5
  • 77
  • 137