You can try again that same command, adding the new --filter=tree:0
option. That should speed up the making of the list of commits wanted.
That is because with Git 2.20 (Q4 2018), the "rev-list --filter
" feature learned to exclude all trees via "tree:0
" filter.
See commit 8b10a20 (18 Oct 2018), commit d9e6d09 (12 Oct 2018), commit bc5975d, commit cc0b05a, commit 696aa73, commit 99c9aa9, commit 7c0fe33 (05 Oct 2018), commit f1d02da (15 Aug 2018), and commit 9202489, commit f447a49 (13 Aug 2018) by Matthew DeVore (matvore
).
(Merged by Junio C Hamano -- gitster
-- in commit 77d5037, 30 Oct 2018)
list-objects
: support for skipping tree traversal
The tree:0
filter does not need to traverse the trees that it has
filtered out, so optimize list-objects and list-objects-filter to skip
traversing the trees entirely.
list-objects-filter: implement filter tree:0
Teach list-objects the "tree:0
" filter which allows for filtering
out all tree and blob objects (unless other objects are explicitly
specified by the user). The purpose of this patch is to allow smaller
partial clones.
The name of this filter - tree:0
- does not explicitly specify that
it also filters out all blobs, but this should not cause much confusion
because blobs are not at all useful without the trees that refer to
them.
I also considered only:commits
as a name, but this is inaccurate because
it suggests that annotated tags are omitted, but actually they are
included.
The name "tree:0
" allows later filtering based on depth, i.e. "tree:1
"
would filter out all but the root tree and blobs.
In order to avoid confusion between 0 and capital O, the documentation was worded in a somewhat round-about way that also hints at this future improvement to the feature.
Note that Git 2.22 (Q2 2019) will fix a regression, where "is this object available to us?
" check for
well-known objects like an empty tree (which should yield "yes",
even when there is no on-disk object for an empty tree), has
been corrected.
See commit f06ab02 (04 Mar 2019) by Jeff King (peff
).
(Merged by Junio C Hamano -- gitster
-- in commit 83b13e2, 20 Mar 2019)
rev-list
: allow cached objects in existence check
This fixes a regression in 7c0fe33 (rev-list: handle missing tree
objects properly, 2018-10-05) where rev-list
will now complain about the
empty tree when it doesn't physically exist on disk.
Before that commit, we relied on the traversal code in list-objects.c
to
walk through the trees. Since it uses parse_tree()
, we'd do a normal
object lookup that includes looking in the set of "cached" objects
(which is where our magic internal empty-tree kicks in).
After that commit, we instead tell list-objects.c
not to die on any
missing trees, and we check them ourselves using has_object_file()
. But
that function uses OBJECT_INFO_SKIP_CACHED
, which means we won't use our
internal empty tree.
This normally wouldn't come up. For most operations, Git will try to
write out the empty tree object as it would any other object. And
pack-objects
in a push
or fetch
will send the empty tree (even if it's
virtual on the sending side).
However, there are cases where this can
matter. One I found in the wild:
The root tree of a commit became empty by deleting all files, without using an index. In this case it was done using libgit2's tree builder API, but as the included test shows, it can easily be done with regular git using hash-object.
The resulting repo works OK, as we'd avoid walking over our own reachable commits for a connectivity check.
Cloning with --reference
pointing to the repository from (1) can trigger the problem, because we tell the other side we already have that commit (and hence the empty tree), but then walk over it during the connectivity check (where we complain about it missing).
Arguably the workflow in step (1) should be more careful about writing
the empty tree object if we're referencing it. But this workflow did
work prior to 7c0fe33, so let's restore it.
Another regression fixed in Git 2.22: A "wrong" object appearing where an object
of a different type is expected, instead of blindly assuming that
the connection between objects are correctly made.
See commit b49e74e, commit 23c2044, commit 0616617 (10 Apr 2019), and commit 5c07647 (05 Apr 2019) by Taylor Blau (ttaylorr
).
See commit 97dd512, commit ee4dfee, commit 8348766 (10 Apr 2019) by Jeff King (peff
).
(Merged by Junio C Hamano -- gitster
-- in commit ea2dab1, 08 May 2019)
rev-list: let traversal die when --missing is not in use
Commit 7c0fe33 (rev-list
: handle missing tree objects properly,
2018-10-05, Git v2.20.0-rc0) taught the traversal machinery used by git-rev-list to ignore missing trees, so that rev-list could handle them itself.
However, it does so only by checking via oid_object_info_extended()
that
the object exists at all.
This can miss several classes of errors that were previously detected by rev-list
:
type mismatches (e.g., we expected a tree but got a blob)
failure to read the object data (e.g., due to bitrot on disk)
This is especially important because we use "rev-list --objects
" as our
connectivity check to admit new objects to the repository, and it will
now miss these cases (though the bitrot one is less important here,
because we'd typically have just hashed and stored the object).
Git 2.23 (Q3 2019) fixes another bug: The filter_data
used in the list-objects-filter
(which manages a lazily sparse clone repository) did not use the dynamic array API correctly---'nr
' is supposed to point at one past the last element
of the array in use.
See commit 7140600 (31 May 2019) by Matthew DeVore (matvore
).
(Merged by Junio C Hamano -- gitster
-- in commit 34032c4, 21 Jun 2019)
list-objects-filter
: correct usage of ALLOC_GROW
In the sparse filter data, array_frame
array is used in a way such that
nr
is the index of the last element.
Fix this so that nr
is actually the number of elements in the array.
The filter_sparse_free
function also has an unaddressed TODO
to free the
memory associated with the sparse filter data.
Address that TODO
and fix the memory leak.
With Git 2.24 (Q4 2019), The name of the blob object that stores the filter specification for sparse cloning/fetching was interpreted in a wrong place in the
code, causing Git to abort: this has been fixed.
See commit a4cafc7, commit 4c96a77 (15 Sep 2019) by Jeff King (peff
).
See commit cf34337, commit 72de589 (15 Sep 2019) by Jon Simons (simonsj
).
(Merged by Junio C Hamano -- gitster
-- in commit ad8f036, 07 Oct 2019)
Git 2.24 (Q4 2019) will make the object traversal faster!
It has been optimized not to load tree objects when we are only interested in commit history.
See commit 72ed80c (12 Sep 2019) by Jeff King (peff
).
(Merged by Junio C Hamano -- gitster
-- in commit bbfe5f2, 07 Oct 2019)
list-objects
: don't queue root trees unless revs->tree_objects
is set
When traverse_commit_list()
processes each commit, it queues the commit's root tree in the pending array.
Then, after all commits are processed, it calls traverse_trees_and_blobs()
to walk over the pending list, calling process_tree()
on each.
But if revs->tree_objects
is not set, process_tree()
just exists immediately!
We can save ourselves some work by not even bothering to queue these trees in the first place.
There are a few subtle points to make:
we also detect commits with a NULL tree pointer here.
But this isn't an interesting check for broken commits, since the lookup_tree()
we'd have done during commit parsing doesn't actually check that we have the tree on disk.
So we're not losing any robustness.
besides queueing, we also set the NOT_USER_GIVEN
flag on the tree object.
This is used by the traverse_commit_list_filtered()
variant.
But if we're not exploring trees, then we won't actually care about this flag, which is used only inside process_tree()
code-paths.
queueing trees eventually leads to us queueing blobs, too.
But we don't need to check revs->blob_objects
here.
Even in the current code, we still wouldn't find those blobs, because we'd never open up the tree objects to list their contents.
the user-visible impact to the caller is minimal.
The pending trees are all cleared by the time the function returns anyway, by traverse_trees_and_blobs()
.
We do call a show_commit()
callback, which technically could be looking at revs->pending
during the callback.
But it seems like a rather unlikely thing to do (if you want the tree of the current commit, then accessing the tree struct member is a lot simpler).
So this should be safe to do.
Let's look at the benefits:
[before]
Benchmark #1: git -C linux rev-list HEAD >/dev/null
Time (mean ± σ): 7.651 s ± 0.021 s [User: 7.399 s, System: 0.252 s]
Range (min … max): 7.607 s … 7.683 s 10 runs
[after]
Benchmark #1: git -C linux rev-list HEAD >/dev/null
Time (mean ± σ): 7.593 s ± 0.023 s [User: 7.329 s, System: 0.264 s]
Range (min … max): 7.565 s … 7.634 s 10 runs
Not too impressive, but then we're really just avoiding sticking a pointer into a growable array.
But still, I'll take a free 0.75% speedup.
Let's try it after running "git commit-graph write
":
[before]
Benchmark #1: git -C linux rev-list HEAD >/dev/null
Time (mean ± σ): 1.458 s ± 0.011 s [User: 1.199 s, System: 0.259 s]
Range (min … max): 1.447 s … 1.481 s 10 runs
[after]
Benchmark #1: git -C linux rev-list HEAD >/dev/null
Time (mean ± σ): 1.126 s ± 0.023 s [User: 896.5 ms, System: 229.0 ms]
Range (min … max): 1.106 s … 1.181 s 10 runs
Now that's more like it. We saved over 22% of the total time.
Part of that is because the runtime is shorter overall, but the absolute
improvement is also much larger.
What's going on?
When we fill in a commit struct using the commit graph, we don't bother to set the tree pointer, and instead lazy-load it when somebody calls get_commit_tree()
.
So we're not only skipping the pointer write to the pending queue, but we're skipping the lazy-load of the tree entirely.
Note that with Git 2.34 (Q4 2021), the method traverse_trees_and_blobs
has been renamed as traverse_non_commits
.
See commit b3e36df (12 Aug 2021) by Teng Long (dyrone
).
(Merged by Junio C Hamano -- gitster
-- in commit 0d4f46b, 30 Aug 2021)
list-objects.c
: rename "traverse_trees_and_blobs"
to "traverse_non_commits"
Signed-off-by: Teng Long
Function traverse_trees_and_blobs
not only works on trees and blobs, but also on tags, the function name is somewhat misleading.
This commit rename it to traverse_non_commits
.
Git 2.39 (Q4 2022) include parse_object()
hardening when checking for the existence of a suspected blob object.
See commit 8db2dad, commit 04fb962 (17 Nov 2022) by Jeff King (peff
).
See commit 40286ca (21 Nov 2022) by Ævar Arnfjörð Bjarmason (avar
).
(Merged by Junio C Hamano -- gitster
-- in commit ba88f8c, 28 Nov 2022)
parse_object()
: check on-disk type of suspected blob
Signed-off-by: Jeff King
Signed-off-by: Taylor Blau
In parse_object()
, we try to handle blobs by streaming rather than loading them entirely into memory.
The most common case here will be that we haven't seen the object yet and check oid_object_info()
, which tells us we have a blob.
But we trigger this code on one other case: when we have an in-memory object struct with type OBJ_BLOB
(and without its "parsed" flag set, since otherwise we'd return early from the function).
This indicates that some other part of the code suspected we have a blob (e.g., it was mentioned by a tree or tag) but we haven't yet looked at the on-disk copy.
I reworked the conditional a bit so that instead of:
if ((suspected_blob && oid_object_info() == OBJ_BLOB)
(no_clue && oid_object_info() == OBJ_BLOB)
we have the simpler:
if ((suspected_blob || no_clue) && oid_object_info() == OBJ_BLOB)
This is shorter, but also reflects what we really want say, which is "have we ruled out this being a blob; if not, check it on-disk".
So this fixes one of the lingering expect_failure
cases from 0616617 ("t
: introduce tests for unexpected object types", 2019-04-09, Git v2.22.0-rc0 -- merge listed in batch #8).
Prior to this commit, we'd quietly check the sha1 and mark the blob as "parsed".
Now we correctly complain about the mismatch ("hash mismatch
").