2

Working on a project to catalog large binary files in a handful of large repos. I'm trying to understand under what scenario might you have a Blob -> Tree -> nothing.. essentially a Blob/Tree that isn't attached to a commit.

I'm running something like this:

  • Get all blobs using: git rev-list --objects --all | git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)'
  • Iterate over blobs (current_blob):
  • Get all commits using git log --pretty=tformat:'%T|%h|%s|%aN|%aE'
  • Iterating through commits (current_commit):
  • Get all objects referenced by a commit using git -C $RepoFolder ls-tree -r <current_commit.id>
  • If any of the objects referenced by the commit match the current_blob then we've found the commit for this blob

What I'm finding is that there are some blobs that relate to trees that do not relate to any commit.

Under what scenario does that happen?

akanieski
  • 61
  • 5

2 Answers2

6

Thats phenonmen is called unreachable object. Most probably you are familiar with the unreachable object type dangling commit, which most commonly occurs when you hard-reset a branch, dropping (hopefully) unwanted commits in the process.

The same happens with many other git operations, notably every invocation of git add, (as git-gc's manpage points out) in case you do not later commit that added state of that file (but maybe a later state after a second add).

Further reading here on SO:

ojdo
  • 8,280
  • 5
  • 37
  • 60
1

Let me put this up front as it may be the most relevant part: For blobs referenced by unreferenced trees, these typically come from using git write-tree. Some Git scripts use this command as a quick way to abort if the index contains unmerged entries.

In general, unreferenced items are normal enough; they're eventually collected and discarded by git gc, usually as a result of a background automatic git gc --auto.

Besides ojdo's answer, consider this:

  • Get all commits using git log --pretty=tformat:'%T|%h|%s|%aN|%aE'

The git log command does a revision (commit-graph) walk starting from the specified revisions, or from HEAD if no starting revision is provided. Some commits may be reachable only from some specific refs.

Even if you add --branches here, this only starts from all branches; some commits might be reachable only from some specific tag, or from a remote-tracking name. Using --all augments this to start from all refs ... but this still omits non-ref references, such as ORIG_HEAD and reflog entries.

Both git fsck and git gc need a fancier method by which they can find all references, including hidden ones. Getting this is actually pretty hard, and was broken between Git 2.5—where git worktree add was first introduced—and Git 2.15, where the bugs were fixed: we must not only consult all refs and reflogs, we must also look at all per-work-tree refs (including each one's HEAD) and each work-tree's index. Git 2.5 through 2.14 failed to check the per-work-tree refs and would thus incorrectly garbage collect expired (via prune-time) loose objects that were in use in added work-trees.

Git's index never contains any tree object ID in the primary section (the one listed by git ls-files --stage). Only blob objects, including both regular files and symbolic links, and gitlinks appear in this section of the index. Gitlinks hold commit hash IDs from other repositories and must be ignored. However, there are extension records in the index. As far as I know these extension records don't count for liveness, so a tree extension would perhaps become invalid. This might not be the case—perhaps a T, R, E, E record does count as keeping a tree object live—but given that they're supposed to be ignorable, I suspect they're not. See the technical documentation file on the index for more.

torek
  • 448,244
  • 59
  • 642
  • 775