git merge-base (and even git rev-list
) will be even faster with:
That is because both are base on the concept of "commits being reachable".
And recent update broke the reachability algorithm when refs (e.g. tags) that point at objects that are not commit were involved, which has been fixed.
See commit 4067a64, commit b67f6b2 (21 Sep 2018), commit 6621c83 (28 Aug 2018), and commit 6cc01743 (20 Jul 2018) by Derrick Stolee (derrickstolee
).
(Merged by Junio C Hamano -- gitster
-- in commit 0f7ac90, 24 Sep 2018)
commit-reach
: use can_all_from_reach
The is_descendant_of
method previously used in_merge_bases()
to check if the commit can reach any of the commits in the provided list.
This had two performance problems:
The performance is quadratic in worst-case.
A single in_merge_bases()
call requires walking beyond the target commit in order to find the full set of boundary commits that may be merge-bases.
The can_all_from_reach
method avoids this quadratic behavior and can limit the search beyond the target commits using generation numbers.
It requires a small prototype adjustment to stop using commit-date
as a cutoff, as that optimization is no longer appropriate here.
Since in_merge_bases()
uses paint_down_to_common()
, is_descendant_of()
naturally found cutoffs to avoid walking the entire commit graph.
Since we want to always return the correct result, we cannot use the min_commit_date
cutoff in can_all_from_reach
. We then rely on generation numbers to provide the cutoff.
Since not all repos will have a commit-graph file, nor will we always have generation numbers computed for a commit-graph file, create a new method, generation_numbers_enabled()
, that checks for a commit-graph file and sees if the first commit in the file has a non-zero generation number.
In the case that we do not have generation numbers, use the old logic for is_descendant_of()
.
Performance was measured on a copy of the Linux repository using the 'test-tool reach is_descendant_of
' command using this input:
A:v4.9
X:v4.10
X:v4.11
X:v4.12
X:v4.13
X:v4.14
X:v4.15
X:v4.16
X:v4.17
X.v3.0
Note that this input is tailored to demonstrate the quadratic nature of the previous method, as it will compute merge-bases for v4.9 versus all of the later versions before checking against v4.1.
Before: 0.26 s
After: 0.21 s
Since we previously used the is_descendant_of method in the ref_newer
method, we also measured performance there using 'test-tool reach ref_newer
' with this input:
A:v4.9
B:v3.19
Before: 0.10 s
After: 0.08 s
By adding a new commit with parent v3.19, we test the non-reachable case
of ref_newer:
Before: 0.09 s
After: 0.08 s
And before Git 2.20 (Q4 2018), the generation of (experimental) commit-graph files have so far been fairly silent, even though it takes noticeable amount of time in a meaningfully large repository.
The users will now see progress output.
See commit 6b89a34 (19 Sep 2018), and commit 1f7f557, commit 7b0f229 (17 Sep 2018) by Ævar Arnfjörð Bjarmason (avar
).
Helped-by: Martin Ågren martin.agren@gmail.com.
(Merged by Junio C Hamano -- gitster
-- in commit 36d767d, 16 Oct 2018)
With Git 2.21 (Q1 2019), that progress will be more accurate:
See commit 01ca387 (19 Nov 2018) by Ævar Arnfjörð Bjarmason (avar
).
Helped-by: Derrick Stolee (derrickstolee
).
(Merged by Junio C Hamano -- gitster
-- in commit d4c9027, 14 Jan 2019)
commit-graph
: split up close_reachable()
progress output
Amend the progress output added in 7b0f229 ("commit-graph write
:
add progress output", 2018-09-17) so that the total numbers it reports
aren't higher than the total number of commits anymore.
See this thread for a bug report pointing that out.
When I added this I wasn't intending to provide an accurate count, but
just have some progress output to show the user the command wasn't
hanging. But since we are showing numbers, let's make them
accurate. The progress descriptions were suggested by Derrick Stolee.
As noted in the original thread, we are unlikely to show anything except the "Expanding reachable..." message even on fairly large repositories such as
linux.git
.
On a test repository I have with north of 7 million commits
all of these are displayed. Two of them don't show up for long, but as
noted in future-proofing this fo
And (still Git 2.21, Q1 2019), the codepath to show progress meter while writing out commit-graph file has been improved.
See commit 49bbc57, commit 890226c, commit e59c615, commit 7c7b8a7, commit d9b1b30, commit 2894473, commit 53035c4 (19 Jan 2019) by Ævar Arnfjörð Bjarmason (avar
).
See commit 857ba92 (23 Jan 2019), and commit 5af7417 (19 Jan 2019) by SZEDER Gábor (szeder
).
(Merged by Junio C Hamano -- gitster
-- in commit e5eac57, 05 Feb 2019)
commit-graph
write: add intermediate progress
Add progress output to sections of code between "Annotating[...]
" and
"Computing[...]generation numbers
".
This can collectively take 5-10 seconds on a large enough repository.
On a test repository with I have, with ~7 million commits and ~50 million objects, we'll now emit:
$ ~/g/git/git --exec-path=$HOME/g/git commit-graph write
Finding commits for commit graph among packed objects: 100% (124763727/124763727), done.
Loading known commits in commit graph: 100% (18989461/18989461), done.
Expanding reachable commits in commit graph: 100% (18989507/18989461), done.
Clearing commit marks in commit graph: 100% (18989507/18989507), done.
Counting distinct commits in commit graph: 100% (18989507/18989507), done.
Finding extra edges in commit graph: 100% (18989507/18989507), done.
Computing commit graph generation numbers: 100% (7250302/7250302), done.
Writing out commit graph in 4 passes: 100% (29001208/29001208), done.
Whereas on a medium-sized repository such as linux.git
, these new progress bars won't have time to kick in and as before and we'll still emit output like:
$ ~/g/git/git --exec-path=$HOME/g/git commit-graph write
Finding commits for commit graph among packed objects: 100% (6529159/6529159), done.
Expanding reachable commits in commit graph: 815990, done.
Computing commit graph generation numbers: 100% (815983/815983), done.
Writing out commit graph in 4 passes: 100% (3263932/3263932), done.
Git 2.24 (Q4 2019) adds a few fixes to make it faster.
See commit dd2e50a (07 Sep 2019) by Jeff King (peff
).
See commit 67fa6aa (07 Sep 2019) by SZEDER Gábor (szeder
).
(Merged by Junio C Hamano -- gitster
-- in commit cda8faa, 07 Oct 2019)
commit-graph
: turn off save_commit_buffer
The commit-graph tool may read a lot of commits, but it only cares about
parsing their metadata (parents, trees, etc) and doesn't ever show the
messages to the user.
And so it should not need save_commit_buffer
, which is meant for holding onto the object data of parsed commits so that we can show them later. In fact, it's quite harmful to do so.
According to massif, the max heap of "git commit-graph write --reachable
" in linux.git
before/after this patch (removing the commit graph file in between) goes from ~1.1GB
to ~270MB
.
Which isn't surprising, since the difference is about the sum of the
uncompressed sizes of all commits in the repository, and this was
equivalent to leaking them.
This obviously helps if you're under memory pressure, but even without
it, things go faster.
My before/after times for that command (without massif) went from 12.521s
to 11.874s
, a speedup of ~5%
.
And:
commit-graph
: don't show progress percentages while expanding reachable commits
Commit 49bbc57 (commit-graph write: emit a percentage for all
progress, 2019-01-19, Git v2.21.0-rc0) was a bit overeager when it added progress
percentages to the "Expanding reachable commits in commit graph" phase
as well, because most of the time the number of commits that phase has
to iterate over is not known in advance and grows significantly, and,
consequently, we end up with nonsensical numbers:
$ git commit-graph write --reachable
Expanding reachable commits in commit graph: 138606% (824706/595), done.
[...]
$ git rev-parse v5.0 | git commit-graph write --stdin-commits
Expanding reachable commits in commit graph: 81264400% (812644/1), done.
[...]
Even worse, because the percentage grows so quickly, the progress code
outputs much more often than it should (because it ticks every second,
or every 1%), slowing the whole process down.
My time for "git commit-graph write --reachable
" on linux.git
went from 13.463s
to 12.521s
with this patch, ~7%
savings.
Therefore, don't show progress percentages in the "Expanding reachable
commits in commit graph" phase.
Git 2.24 (Q4 2019) adds another optimization.
See commit 7371612 (26 Aug 2019) by Garima Singh (singhgarima
).
(Merged by Junio C Hamano -- gitster
-- in commit caf150c, 07 Oct 2019)
commit-graph
: add --[no-]progress
to write and verify
Add --[no-]progress
to git commit-graph write
and verify
.
The progress feature was introduced in 7b0f229
("commit-graph write
: add progress output", 2018-09-17, Git v2.20.0-rc0) but
the ability to opt-out was overlooked.
With Git 2.28 (Q3 2020), "git merge-base --is-ancestor
" is taught to take advantage of the commit graph.
See commit 80b8ada, commit d91d6fb (17 Jun 2020) by Derrick Stolee (derrickstolee
).
(Merged by Junio C Hamano -- gitster
-- in commit dc4b3cf, 25 Jun 2020)
commit-reach
: use fast logic in repo_in_merge_base
Reported-by: Ævar Arnfjörð Bjarmason
Reported-by: SZEDER Gábor
Signed-off-by: Derrick Stolee
The repo_is_descendant_of()
method is aware of the existence of the commit-graph file. It checks for generation_numbers_enabled()
before deciding on using can_all_from_reach()
or repo_in_merge_bases()
depending on the situation. The reason here is that can_all_from_reach()
uses a depth-first search that is limited by the minimum generation number of the target commits, and that algorithm can be very slow when generation numbers are not present. The alternative uses paint_down_to_common()
which will walk the entire merge-base boundary, which is typically slower.
This method is used by commands like "git tag --contains
" and "git branch --contains
" for very fast results when a commit-graph file exists.
Unfortunately, it is not used in commands like "git merge-base --is-ancestor
" which is doing an even simpler request.
This issue was raised recently with respect to a change to how generation numbers are stored, but was also reported much earlier before commit-reach.c
existed to simplify these reachability queries.
The root cause is that builtin/merge-base.c
has a method handle_is_ancestor()
that calls in_merge_bases()
, an older version of repo_in_merge_bases()
.
It would be better if we have every caller to in_merge_bases()
use the logic in can_all_from_reach()
when possible.
This is where things get a little tricky: repo_is_descendant_of()
calls repo_in_merge_bases()
in the non-generation numbers enabled case! If we simply update repo_in_merge_bases()
to call repo_is_descendant_of()
instead of repo_in_merge_bases_many()
, then we will get a recursive call loop. Thankfully, this is caught by the test suite in the default mode (i.e. GIT_TEST_COMMIT_GRAPH=0
).
The trick, then, is to make the non-generation number case for repo_is_descendant_of()
call repo_in_merge_bases_many()
directly, skipping the non-_many
version. This allows us to take advantage of this faster code path, when possible.
The easiest way to measure the performance impact is to test the following command on the Linux kernel repository:
git merge-base --is-ancestor <A> <B>
| A | B | Time Before | Time After |
|------|------|-------------|------------|
| v3.0 | v5.7 | 0.459s | 0.028s |
| v4.0 | v5.7 | 0.267s | 0.021s |
| v5.0 | v5.7 | 0.074s | 0.013s |
Note that each of these samples return success. The old code performed the same operation when <A>
and <B>
are swapped.
However, can_all_from_reach()
will return immediately if the generation numbers show that <A>
has larger generation number than <B>
.
Thus, the time for the swapped case is universally 0.004s in each case.
With Git 2.28 (Q3 2020), is_descendant_of()
is no longer used:
See commit c1ea625 (23 Jun 2020) by Carlo Marcelo Arenas Belón (carenas
).
(Merged by Junio C Hamano -- gitster
-- in commit 0258ed1, 06 Jul 2020)
commit-reach
: avoid is_descendant_of()
shim
Helped-by: Derrick Stolee
Signed-off-by: Carlo Marcelo Arenas Belón
Reviewed-by: Derrick Stolee
d91d6fbf26 ("commit-reach
: create repo_is_descendant_of()
", 2020-06-17, Git v2.28.0-rc0 -- merge listed in batch #5) adds a repository aware version of is_descendant_of()
and a backward compatibility shim that is barely used.
Update all callers to directly use the new repo_is_descendant_of()
function instead; making the codebase simpler and pushing more the_repository
references higher up the stack.
Also, before Git 2.31 (Q1 2021), the code to implement "git merge-base --independent
"(man) was poorly done and was kept from the very beginning of the feature.
See commit 41f3c99, commit 3677773, commit c8d693e, commit fbc21e3 (19 Feb 2021), and commit 0fac156 (01 Feb 2021) by Derrick Stolee (derrickstolee
).
(Merged by Junio C Hamano -- gitster
-- in commit 48923e8, 25 Feb 2021)
commit-reach
: use heuristic in remove_redundant()
Signed-off-by: Derrick Stolee
Reachability algorithms in commit-reach.c
frequently benefit from using the first-parent history as a heuristic for satisfying reachability queries.
The most obvious example was implemented in 4fbcca4 ("commit-reach
: make can_all_from_reach
...
linear", 2018-07-20, Git v2.20.0-rc0 -- merge listed in batch #1).
Update the walk in remove_redundant()
to use this same heuristic.
Here, we are walking starting at the parents of the input commits.
Sort those parents and walk from the highest generation to lower.
Each time, use the heuristic of searching the first parent history before continuing to expand the walk.
The order in which we explore the commits matters, so update compare_commits_by_gen
to break generation number ties with commit date.
This has no effect when the commits are in a commit-graph file with corrected commit dates computed, but it will assist when the commits are in the region "above" the commit-graph with "infinite" generation number.
Note that we cannot shift to use compare_commits_by_gen_then_commit_date
as the method prototype is different.
We use compare_commits_by_gen
for QSORT()
as opposed to as a priority function.
The important piece is to ensure we short-circuit the walk when we find that there is a single non-redundant commit.
This happens frequently when looking for merge-bases or comparing several tags with 'git merge-base --independent
'(man).
Use a new count 'count_still_independent
' and if that hits 1 we can stop walking.
To update 'count_still_independent
' properly, we add use of the RESULT flag on the input commits.
Then we can detect when we reach one of these commits and decrease the count.
We need to remove the RESULT flag at that moment because we might re-visit that commit when popping the stack.
We use the STALE flag to mark parents that have been added to the new walk_start
list, but we need to clear that flag before we start walking so those flags don't halt our depth-first-search walk.
On my copy of the Linux kernel repository, the performance of 'git merge-base --independent
' <all-tags>
goes from 1.1 seconds to 0.11 seconds.
With Git 2.31 (Q1 2021), the common code to deal with "chunked file format" that is shared by the multi-pack-index and commit-graph files have been factored out, to help codepaths for both filetypes to become more robust.
See commit c4ff24b (24 Feb 2021) by Taylor Blau (ttaylorr
).
See commit a43a2e6, commit 5387fef, commit 329fac3, commit 6ab3b8b, commit 2692c2f, commit 5f0879f, commit 63a8f0e, commit c144241, commit 0ccd713, commit 980f525, commit 7a3ada1, commit 31bda9a, commit b4d9414, commit 577dc49, commit 47410aa, commit 570df42 (18 Feb 2021), and commit eb90719 (05 Feb 2021) by Derrick Stolee (derrickstolee
).
(Merged by Junio C Hamano -- gitster
-- in commit 660dd97, 01 Mar 2021)
commit-graph.c
: display correct number of chunks when writing
Reported-by: SZEDER Gábor
Signed-off-by: Taylor Blau
Acked-by: Derrick Stolee
When writing a commit-graph, a progress meter is shown which indicates the number of pieces of data to write (one per commit in each chunk).
In 47410aa ("commit-graph
: use chunk-format write API", 2021-02-18, Git v2.32.0 -- merge), the number of chunks became tracked by the new chunk-format API.
But a stray local variable was left behind from when write_commit_graph_file()
used to keep track of the same.
Since this was no longer updated after 47410aa, the progress meter appeared broken:
$ git commit-graph write --reachable
Expanding reachable commits in commit graph: 837569, done.
Writing out commit graph in 3 passes: 166% (4187845/2512707), done.
Drop the local variable and rely instead on the chunk-format API to tell us the correct number of chunks.