Slow git rev-list when iterating git commit history over time intervals

Question

I wrote a tool that uses git rev-list -n1 --before=X to iterate a Git commit history using fixed time intervals, so that I see the last revision for every year, month, etc.

The problem is that rev-list kicks off a new revision walk on every call, and it takes longer the father back I go. Here are some samples from the Linux kernel source.

$ time git rev-list -n1 --before="Jan 1 2016" HEAD
a889331d759453fa7f424330f75ae4e2b9e02db4

real    0m1.395s
user    0m1.367s
sys 0m0.024s

$ time git rev-list -n1 --before="Jan 1 2015" HEAD
5b5e76218fbdbb71a01d5480f289ead624232876

real    0m2.349s
user    0m2.306s
sys 0m0.036s

$ time git rev-list -n1 --before="Jan 1 2005" HEAD

real    0m5.556s
user    0m5.435s
sys 0m0.105s

If I want to call rev-list in a loop over N decreasing dates, that loop runs N walks that take longer on every iteration. The docs talk about bit-maps and object traversal strategies to speed up the history, but I am having trouble understanding them. I tried git repack -ab followed by git rev-list --use-bitmap-index, but that didn't improve results.

My only requirement is that given any position for HEAD, I can accurately pinpoint the first revision that appears before the date given to --before, following paths to ancestors if needed.

What is the best way to make rev-list faster for this use case?

score 2 · Answer 1 · answered Jun 07 '17 at 04:58

In the general case, there is no alternative to these two options:

pay the full cost of the walk, or
store much more information than git rev-list emits when you just ask for one node

because the graph walk effectively linearizes (via a priority queue, with the dates as the priorities) an arbitrarily bushy tree (this "tree" is formed by cutting off rejoins, by simply not re-visiting any already-visited node in the DAG). For instance, suppose we have a commit DAG that looks much like this. Assume all the directional arcs point left-ish (directly left, or up-and-left, or down-and-left).

       A-....--o
      /         \
...--o----...----o   <-- HEAD
      \         /
       B--...--o

where there are any number of nodes anywhere there are two or three dots. One of your selections might pick node A as the first one visited that meets the --before criteria. Node B might come earlier than A and hence might be one selected by an earlier --before that starts from HEAD (since node B is reachable from HEAD). But node B is not reachable from node A, so by simply starting from node A, you will never find node B.

If you somehow had git rev-list dump out its current priority queue contents (plus information about all the nodes it had visited thus far—though this is not, in the end, required to get the same results; that would just be an option for potentially speeding up some node-trimming) you could restart its graph-walk operation from those points in order to search for B, and thereby avoid revisiting many of the nodes in the large mesh-y area above.

If you strongly limit the graph walk, though, e.g., using --first-parent, you can simply restart the walk from the most-recently-found commit, since you know that the queue depth is always 1 and therefore the next node to visit is the first parent of the node you found earlier (which you can leave to git rev-list).

jthill · Accepted Answer · 2017-06-07T05:36:03.210

2

Repeatedly scanning a list to select successive elements is O(N^2). Doesn't much matter how efficient the scan is, that N^2 is going to bite.

Generate one list with commit id and date, then strip what you don't want and generate your real log messages from the selected sha's. That's three passes total, not N.

git log --first-parent --pretty=%H\ %cd --date=short \
| awk '$2$3 != last { last=$2$3; print $1}' 'FS=[- ]' \
| git log --no-walk --stdin

That took fifteen seconds cold-cache on the linux repo, with a spinny-things hdd, listing 147 commits. A rerun took less than a second.

edit: subbing in --date-order for --first-parent to consider all paths took 25.1 seconds cold-cache, 7.9 seconds hot, to list 782 commits.

edited Jun 07 '17 at 05:36

answered Jun 07 '17 at 05:24

jthill

55,082
5
77
137

I'm loving this approach, but I was hoping the last commit reported matched the chronologically first revision from `git rev-list --max-parents=0 HEAD`, namely `1da177e4c3f4`. This is contained in Linux's `master` so why am I not seeing it in your command? – Sage Gerard Jun 07 '17 at 06:03
$2$3 is year and month, it shows the most-recent commit for each month. 1da177etc's 2005-04-16¸ it's showing a more recent commit from that April. – jthill Jun 07 '17 at 06:26

VonC · Answer 3 · 2022-11-28T20:19:35.650

You can try again that same command, adding the new --filter=tree:0 option. That should speed up the making of the list of commits wanted.

That is because with Git 2.20 (Q4 2018), the "rev-list --filter" feature learned to exclude all trees via "tree:0" filter.

See commit 8b10a20 (18 Oct 2018), commit d9e6d09 (12 Oct 2018), commit bc5975d, commit cc0b05a, commit 696aa73, commit 99c9aa9, commit 7c0fe33 (05 Oct 2018), commit f1d02da (15 Aug 2018), and commit 9202489, commit f447a49 (13 Aug 2018) by Matthew DeVore (matvore).
^{(Merged by Junio C Hamano -- gitster -- in commit 77d5037, 30 Oct 2018)}

list-objects: support for skipping tree traversal

The tree:0 filter does not need to traverse the trees that it has filtered out, so optimize list-objects and list-objects-filter to skip traversing the trees entirely.

list-objects-filter: implement filter tree:0

Teach list-objects the "tree:0" filter which allows for filtering out all tree and blob objects (unless other objects are explicitly specified by the user). The purpose of this patch is to allow smaller partial clones.

The name of this filter - tree:0 - does not explicitly specify that it also filters out all blobs, but this should not cause much confusion because blobs are not at all useful without the trees that refer to them.

I also considered only:commits as a name, but this is inaccurate because it suggests that annotated tags are omitted, but actually they are included.

The name "tree:0" allows later filtering based on depth, i.e. "tree:1" would filter out all but the root tree and blobs.
In order to avoid confusion between 0 and capital O, the documentation was worded in a somewhat round-about way that also hints at this future improvement to the feature.

Note that Git 2.22 (Q2 2019) will fix a regression, where "is this object available to us?" check for well-known objects like an empty tree (which should yield "yes", even when there is no on-disk object for an empty tree), has been corrected.

See commit f06ab02 (04 Mar 2019) by Jeff King (peff).
^{(Merged by Junio C Hamano -- gitster -- in commit 83b13e2, 20 Mar 2019)}

rev-list: allow cached objects in existence check

This fixes a regression in 7c0fe33 (rev-list: handle missing tree objects properly, 2018-10-05) where rev-list will now complain about the empty tree when it doesn't physically exist on disk.

Before that commit, we relied on the traversal code in list-objects.c to walk through the trees. Since it uses parse_tree(), we'd do a normal object lookup that includes looking in the set of "cached" objects (which is where our magic internal empty-tree kicks in).

After that commit, we instead tell list-objects.c not to die on any missing trees, and we check them ourselves using has_object_file(). But that function uses OBJECT_INFO_SKIP_CACHED, which means we won't use our internal empty tree.

This normally wouldn't come up. For most operations, Git will try to write out the empty tree object as it would any other object. And pack-objects in a push or fetch will send the empty tree (even if it's virtual on the sending side).
However, there are cases where this can matter. One I found in the wild:

The root tree of a commit became empty by deleting all files, without using an index. In this case it was done using libgit2's tree builder API, but as the included test shows, it can easily be done with regular git using hash-object.
The resulting repo works OK, as we'd avoid walking over our own reachable commits for a connectivity check.

Cloning with --reference pointing to the repository from (1) can trigger the problem, because we tell the other side we already have that commit (and hence the empty tree), but then walk over it during the connectivity check (where we complain about it missing).

Arguably the workflow in step (1) should be more careful about writing the empty tree object if we're referencing it. But this workflow did work prior to 7c0fe33, so let's restore it.

Another regression fixed in Git 2.22: A "wrong" object appearing where an object of a different type is expected, instead of blindly assuming that the connection between objects are correctly made.

See commit b49e74e, commit 23c2044, commit 0616617 (10 Apr 2019), and commit 5c07647 (05 Apr 2019) by Taylor Blau (ttaylorr).
See commit 97dd512, commit ee4dfee, commit 8348766 (10 Apr 2019) by Jeff King (peff).
^{(Merged by Junio C Hamano -- gitster -- in commit ea2dab1, 08 May 2019)}

rev-list: let traversal die when --missing is not in use

Commit 7c0fe33 (rev-list: handle missing tree objects properly, 2018-10-05, Git v2.20.0-rc0) taught the traversal machinery used by git-rev-list to ignore missing trees, so that rev-list could handle them itself.

However, it does so only by checking via oid_object_info_extended() that the object exists at all.
This can miss several classes of errors that were previously detected by rev-list:

type mismatches (e.g., we expected a tree but got a blob)

failure to read the object data (e.g., due to bitrot on disk)

This is especially important because we use "rev-list --objects" as our connectivity check to admit new objects to the repository, and it will now miss these cases (though the bitrot one is less important here, because we'd typically have just hashed and stored the object).

Git 2.23 (Q3 2019) fixes another bug: The filter_data used in the list-objects-filter (which manages a lazily sparse clone repository) did not use the dynamic array API correctly---'nr' is supposed to point at one past the last element of the array in use.

See commit 7140600 (31 May 2019) by Matthew DeVore (matvore).
^{(Merged by Junio C Hamano -- gitster -- in commit 34032c4, 21 Jun 2019)}

list-objects-filter: correct usage of ALLOC_GROW

In the sparse filter data, array_frame array is used in a way such that nr is the index of the last element.
Fix this so that nr is actually the number of elements in the array.

The filter_sparse_free function also has an unaddressed TODO to free the memory associated with the sparse filter data.
Address that TODO and fix the memory leak.

With Git 2.24 (Q4 2019), The name of the blob object that stores the filter specification for sparse cloning/fetching was interpreted in a wrong place in the code, causing Git to abort: this has been fixed.

See commit a4cafc7, commit 4c96a77 (15 Sep 2019) by Jeff King (peff).
See commit cf34337, commit 72de589 (15 Sep 2019) by Jon Simons (simonsj).
^{(Merged by Junio C Hamano -- gitster -- in commit ad8f036, 07 Oct 2019)}

Git 2.24 (Q4 2019) will make the object traversal faster!
It has been optimized not to load tree objects when we are only interested in commit history.

See commit 72ed80c (12 Sep 2019) by Jeff King (peff).
^{(Merged by Junio C Hamano -- gitster -- in commit bbfe5f2, 07 Oct 2019)}

list-objects: don't queue root trees unless revs->tree_objects is set

When traverse_commit_list() processes each commit, it queues the commit's root tree in the pending array.
Then, after all commits are processed, it calls traverse_trees_and_blobs() to walk over the pending list, calling process_tree() on each.
But if revs->tree_objects is not set, process_tree() just exists immediately!

We can save ourselves some work by not even bothering to queue these trees in the first place.
There are a few subtle points to make:

we also detect commits with a NULL tree pointer here.
But this isn't an interesting check for broken commits, since the lookup_tree() we'd have done during commit parsing doesn't actually check that we have the tree on disk.
So we're not losing any robustness.

besides queueing, we also set the NOT_USER_GIVEN flag on the tree object.
This is used by the traverse_commit_list_filtered() variant.
But if we're not exploring trees, then we won't actually care about this flag, which is used only inside process_tree() code-paths.

queueing trees eventually leads to us queueing blobs, too.
But we don't need to check revs->blob_objects here.
Even in the current code, we still wouldn't find those blobs, because we'd never open up the tree objects to list their contents.

the user-visible impact to the caller is minimal.
The pending trees are all cleared by the time the function returns anyway, by traverse_trees_and_blobs().
We do call a show_commit() callback, which technically could be looking at revs->pending during the callback.
But it seems like a rather unlikely thing to do (if you want the tree of the current commit, then accessing the tree struct member is a lot simpler).

So this should be safe to do.

Let's look at the benefits:
[before]
Benchmark #1: git -C linux rev-list HEAD >/dev/null
Time (mean ± σ):      7.651 s ±  0.021 s    [User: 7.399 s, System: 0.252 s]
Range (min … max):    7.607 s …  7.683 s    10 runs

[after]
Benchmark #1: git -C linux rev-list HEAD >/dev/null
Time (mean ± σ):      7.593 s ±  0.023 s    [User: 7.329 s, System: 0.264 s]
Range (min … max):    7.565 s …  7.634 s    10 runs
Not too impressive, but then we're really just avoiding sticking a pointer into a growable array.
But still, I'll take a free 0.75% speedup.

Let's try it after running "git commit-graph write":
[before]
Benchmark #1: git -C linux rev-list HEAD >/dev/null
Time (mean ± σ):      1.458 s ±  0.011 s    [User: 1.199 s, System: 0.259 s]
Range (min … max):    1.447 s …  1.481 s    10 runs

[after]
Benchmark #1: git -C linux rev-list HEAD >/dev/null
Time (mean ± σ):      1.126 s ±  0.023 s    [User: 896.5 ms, System: 229.0 ms]
Range (min … max):    1.106 s …  1.181 s    10 runs
Now that's more like it. We saved over 22% of the total time.
Part of that is because the runtime is shorter overall, but the absolute improvement is also much larger.
What's going on?

When we fill in a commit struct using the commit graph, we don't bother to set the tree pointer, and instead lazy-load it when somebody calls get_commit_tree().
So we're not only skipping the pointer write to the pending queue, but we're skipping the lazy-load of the tree entirely.

Note that with Git 2.34 (Q4 2021), the method traverse_trees_and_blobs has been renamed as traverse_non_commits.

See commit b3e36df (12 Aug 2021) by Teng Long (dyrone).
^{(Merged by Junio C Hamano -- gitster -- in commit 0d4f46b, 30 Aug 2021)}

list-objects.c: rename "traverse_trees_and_blobs" to "traverse_non_commits"

^{Signed-off-by: Teng Long}

Function traverse_trees_and_blobs not only works on trees and blobs, but also on tags, the function name is somewhat misleading.
This commit rename it to traverse_non_commits.

Git 2.39 (Q4 2022) include parse_object() hardening when checking for the existence of a suspected blob object.

See commit 8db2dad, commit 04fb962 (17 Nov 2022) by Jeff King (peff).
See commit 40286ca (21 Nov 2022) by Ævar Arnfjörð Bjarmason (avar).
^{(Merged by Junio C Hamano -- gitster -- in commit ba88f8c, 28 Nov 2022)}

parse_object(): check on-disk type of suspected blob

^{Signed-off-by: Jeff King}
^{Signed-off-by: Taylor Blau}

In parse_object(), we try to handle blobs by streaming rather than loading them entirely into memory.
The most common case here will be that we haven't seen the object yet and check oid_object_info(), which tells us we have a blob.

But we trigger this code on one other case: when we have an in-memory object struct with type OBJ_BLOB (and without its "parsed" flag set, since otherwise we'd return early from the function).
This indicates that some other part of the code suspected we have a blob (e.g., it was mentioned by a tree or tag) but we haven't yet looked at the on-disk copy.

I reworked the conditional a bit so that instead of:
if ((suspected_blob && oid_object_info() == OBJ_BLOB)
    (no_clue && oid_object_info() == OBJ_BLOB)
we have the simpler:
if ((suspected_blob || no_clue) && oid_object_info() == OBJ_BLOB)
This is shorter, but also reflects what we really want say, which is "have we ruled out this being a blob; if not, check it on-disk".

So this fixes one of the lingering expect_failure cases from 0616617 ("t: introduce tests for unexpected object types", 2019-04-09, Git v2.22.0-rc0 -- merge listed in batch #8).

Prior to this commit, we'd quietly check the sha1 and mark the blob as "parsed".
Now we correctly complain about the mismatch ("hash mismatch").

Slow git rev-list when iterating git commit history over time intervals

3 Answers3

`list-objects`: support for skipping tree traversal

list-objects-filter: implement filter tree:0

`rev-list`: allow cached objects in existence check

rev-list: let traversal die when --missing is not in use

`list-objects-filter`: correct usage of `ALLOC_GROW`

`list-objects`: don't queue root trees unless `revs->tree_objects` is set

`list-objects.c`: rename `"traverse_trees_and_blobs"` to `"traverse_non_commits"`

`parse_object()`: check on-disk type of suspected blob

Linked

Slow git rev-list when iterating git commit history over time intervals

3 Answers3

list-objects: support for skipping tree traversal

list-objects-filter: implement filter tree:0

rev-list: allow cached objects in existence check

rev-list: let traversal die when --missing is not in use

list-objects-filter: correct usage of ALLOC_GROW

list-objects: don't queue root trees unless revs->tree_objects is set

list-objects.c: rename "traverse_trees_and_blobs" to "traverse_non_commits"

parse_object(): check on-disk type of suspected blob

Linked

`list-objects`: support for skipping tree traversal

`rev-list`: allow cached objects in existence check

`list-objects-filter`: correct usage of `ALLOC_GROW`

`list-objects`: don't queue root trees unless `revs->tree_objects` is set

`list-objects.c`: rename `"traverse_trees_and_blobs"` to `"traverse_non_commits"`

`parse_object()`: check on-disk type of suspected blob