With Git 2.27 (Q2 2020), "git blame
" learns to take advantage of the "changed-paths
" Bloom filter stored in the commit-graph file, and introduced with git log
.
See commit 1b4c57f, commit 24b7d1e, commit fe88f9f (23 Apr 2020) by Jeff King (peff
).
See commit 0906ac2, commit b23ea97, commit 8918e37 (16 Apr 2020) by Derrick Stolee (derrickstolee
).
(Merged by Junio C Hamano -- gitster
-- in commit 6d56d4c, 01 May 2020)
blame
: use changed-path
Bloom filters
Signed-off-by: Derrick Stolee
The changed-path
Bloom filters help reduce the amount of tree parsing required during history queries.
Before calculating a diff, we can ask the filter if a path changed between a commit and its first parent.
- If the filter says "no" then we can move on without parsing trees.
- If the filter says "maybe" then we parse trees to discover if the answer is actually "yes" or "no".
When computing a blame, there is a section in find_origin()
that computes a diff between a commit and one of its parents.
When this is the first parent, we can check the Bloom filters before calling diff_tree_oid()
.
In order to make this work with the blame machinery, we need to initialize a struct bloom_key
with the initial path. But also, we need to add more keys to a list if a rename is detected. We then check to see if any of these keys answer "maybe" in the diff.
If a user requests copy detection using "git blame -C
", then there are more places where the set of "important" files can expand. I do not know enough about how this happens in the blame machinery.
Thus, the Bloom filter integration is explicitly disabled in this mode.
A later change could expand the bloom_key
data with an appropriate call (or calls) to add_bloom_key()
.
Generally, this is a performance enhancement and should not change the behavior of 'git blame
' in any way.
If a repo has a commit-graph file with computed changed-path Bloom filters, then they should notice improved performance for their 'git blame
' commands.
Here are some example timings that I found by blaming some paths in the Linux kernel repository:
I specifically looked for "deep" paths that were also edited many times.
As a counterpoint, the MAINTAINERS
file was edited many times but is located in the root tree.
This means that the cost of computing a diff relative to the pathspec is very small. Here are the timings for that command:
These timings are the best of five.
The worst-case runs were on the order of 2.5 minutes for both cases.
Note that the MAINTAINERS
file has 18,740 lines across 17,000+ commits. This happens to be one of the cases where this change provides the least improvement.
The lack of improvement for the MAINTAINERS
file and the relatively modest improvement for the other examples can be easily explained.
The blame machinery needs to compute line-level diffs to determine which lines were changed by each commit. That makes up a large proportion of the computation time, and this change does not attempt to improve on that section of the algorithm.
The MAINTAINERS
file is large and changed often, so it takes time to determine which lines were updated by which commit. In contrast, the code files are much smaller, and it takes longer to compute the line-by-line diff for a single patch on the Linux mailing lists.
Outside of the "-C
" integration, I believe there is little more to gain from the changed-path Bloom filters for 'git blame
' after this patch.
Make sure to use Git 2.29 (Q4 2020), though, as there was a small bug:
See commit 1302bad (08 Sep 2020) by Edmundo Carmona Antoranz (eantoranz
).
(Merged by Junio C Hamano -- gitster
-- in commit e1dd499, 18 Sep 2020)
blame.c
: replace instance of !oidcmp
for oideq
Signed-off-by: Edmundo Carmona Antoranz
0906ac2b ("blame
: use changed-path Bloom filters", 2020-04-16, Git v2.27.0-rc0 -- merge listed in batch #6) introduced a call to oidcmp() that should have been oideq()
, which was introduced in 14438c44 ("introduce hasheq()
and oideq()
", 2018-08-28, Git v2.20.0-rc0 -- merge listed in batch #1).
With Git 2.29 (Q4 2020), "git commit-graph
(man) write" learned to limit the number of bloom filters that are computed from scratch with the --max-new-filters
option.
That will benefit git blame
.
See commit d356d5d, commit 98bb796, commit 59f0d50, commit 97ffa4f (17 Sep 2020), commit 809e032 (18 Sep 2020), commit 9a7a9ed, commit 312cff5 (16 Sep 2020), and commit b66d847, commit 24f951a, commit ab14d06, commit 025d529, commit 4f36440 (09 Sep 2020) by Taylor Blau (ttaylorr
).
See commit b16a827 (16 Sep 2020) by Derrick Stolee (derrickstolee
).
(Merged by Junio C Hamano -- gitster
-- in commit 288ed98, 29 Sep 2020)
Helped-by: Junio C Hamano
Signed-off-by: Taylor Blau
Introduce a command-line flag to specify the maximum number of new Bloom filters that a 'git commit-graph write
'(man) is willing to compute from scratch.
Prior to this patch, a commit-graph write with '--changed-paths
' would compute Bloom filters for all selected commits which haven't already been computed (i.e., by a previous commit-graph write with '--split
' such that a roll-up or replacement is performed).
This behavior can cause prohibitively-long commit-graph writes for a variety of reasons:
- There may be lots of filters whose diffs take a long time to generate (for example, they have close to the maximum number of changes, diffing itself takes a long time, etc).
- Old-style commit-graphs (which encode filters with too many entries as not having been computed at all) cause us to waste time recomputing filters that appear to have not been computed only to discover that they are too-large.
This can make the upper-bound of the time it takes for 'git commit-graph write --changed-paths
'(man) to be rather unpredictable.
To make this command behave more predictably, introduce '--max-new-filters=<n>
' to allow computing at most '<n>
' Bloom filters from scratch.
This lets "computing" already-known filters proceed quickly, while bounding the number of slow tasks that Git is willing to do.
git commit-graph
now includes in its man page:
With the --max-new-filters=<n>
option, generate at most n
new Bloom
filters (if --changed-paths
is specified).
If n
is -1
, no limit is enforced.
Only commits present in the new layer count against this limit.
To retroactively compute Bloom filters over earlier layers, it is advised to use --split=replace
.
With Git 2.31 (Q1 2021), optimization in "git blame
"(man)
See commit 8e16eff (17 Feb 2021) by Rafael Silva (raffs
).
(Merged by Junio C Hamano -- gitster
-- in commit 18decfd, 25 Feb 2021)
blame
: remove unnecessary use of get_commit_info()
Signed-off-by: Rafael Silva
Reviewed-by: Taylor Blau
When git blame
(man) --color-by-age
, the determine_line_heat()
is called to select how to color the output based on the commit's author date.
It uses the get_commit_info()
to parse the information into a commit_info
structure, however, this is actually unnecessary because the determine_line_heat()
caller also does the same.
Instead, let's change the determine_line_heat()
to take a commit_info
structure and remove the internal call to get_commit_info()
thus cleaning up and optimizing the code path.
Enabling Git's trace2 API in order to record the execution time for every call to determine_line_heat()
function:
+ trace2_region_enter("blame", "determine_line_heat", the_repository);
determine_line_heat(ent, &default_color);
+ trace2_region_enter("blame", "determine_line_heat", the_repository);
Then, running git blame
for "kernel/fork.c
" in linux.git and summing all the execution time for every call (around 1.3k calls) resulted in 2.6x faster execution (best out 3):
git built from 328c109303 (The eighth batch, 2021-02-12) = 42ms
git built from 328c109303 + this change = 16ms