Both other people and I have described this in different (and no links) detail elsewhere, e.g., this answer to What's git's heuristic for assigning content modifications to file paths? or my answer to Git Diff of same files in two directories always result in "renamed". The details are slightly different for git log --follow
than they are for git diff
, since git diff
usually deals with an entire tree-full of left and right side files, but git log --follow
only works with one particular path.1
In any case, the rename-following happens when comparing two specific commits. For a general git diff
they are any two commits R (right side) and L (left side—you choose the two),2 but for git log
they are specifically parent and child. Let's call these P and C for convenience. With git log --follow
, Git runs an after-diff step (called from diff_tree_sha1
; see footnotes) that trims everything down to one file. The diff is done with R=C and L=P. The general case is actually easier to describe, though, so we'll start with that.
Normally, when comparing R vs L, Git:
- matches up all the tree files that have the same full path name, then
- puts the remaining files (paths) into a pairing queue.
You can modify this a bit with the -B
(pair-breaking) flag, which actually takes two optional integers (-Bn/m
). Only the n
integer matters for rename detection.3 You can modify it with the -C
flag as well; this takes only an optional n
, and turns on copy detection. In all cases rename detection must be turned on. Rename detection is enabled via -M
, which likewise takes an optional integer n
, or automatically in the case of git log --follow
and other commands like git status
or the post-merge git diff --stat
.
In any case, the integer n
here is a similarity (or dissimilarity) metric value for all of these various options. This is where we get to the meat of the rename detection code.
Suppose that we have, to start with, a basic git diff <commit1> <commit2>
or git diff <tree1> <tree2>
operation. This winds up calling builtin_diff_tree
in builtin/diff.c
, which calls diff_tree_sha1
(which we'll see again later) and then log_tree_diff_flush
in log-tree.c
. This almost immediately calls diffcore_std
in diff.c
, which runs the the diffcore_break
, diffcore_rename
, and diffcore_merge_broken
functions if the right (-B
, -M
or -C
, and -B
again) options are selected.
These three functions operate on a pairing queue. How does the pairing queue get set up? I'll leave that to another section since it's complicated. For now, just assume that the pairing queue already has path/to/file
matched with path/to/file
when there's a path/to/file
in both L and R, and otherwise has an unpaired path/to/L-only
and path/to/R-only
for cases where there's a file-path that occurs only in L or only in R.
The diffcore_break
function is in diffcore-break.c
. Its job is to find already-paired files whose dissimilarity index (when comparing the L and R versions) is above some threshold. If that's the case, it breaks the pairing. The diffcore_merge
function is just below it in the same file; it rejoins a broken-up pair if neither half has found a "better mate". The dissimilarity index computation is similar to, but not the same as, the similarity computation.4
Similarity detection
The more interesting diffcore_rename
function is in diffcore-rename.c
. It has a special case shortcut for --follow
that we can ignore for now. It then looks for exact renames, i.e., files whose blob hashes match, even though their names don't. There are some fiddly bits for using "the next file" if multiple L sources have the same hash as some unpaired R destination, too.
Next, it checks how many unpaired entries there are, because it is going to (in effect) do num(L) times num(R) comparisons of files to compute their similarities, and this is going to take a lot of time and space. It will even automatically downgrade a --find-copies-harder
case that is "too hard". Then, for each possible L and R pairing, it computes a similarity index and a name score.
The similarity index code is in estimate_similarity
in diffcore-rename.c
. It relies on the function diffcore_count_changes
in diffcore-delta.c
, which says this (I'm copying it straight from the file since it's one of the core metrics):
* Idea here is very simple.
*
* Almost all data we are interested in are text, but sometimes we have
* to deal with binary data. So we cut them into chunks delimited by
* LF byte, or 64-byte sequence, whichever comes first, and hash them.
*
* For those chunks, if the source buffer has more instances of it
* than the destination buffer, that means the difference are the
* number of bytes not copied from source to destination. If the
* counts are the same, everything was copied from source to
* destination. If the destination has more, everything was copied,
* and destination added more.
*
* We are doing an approximation so we do not really have to waste
* memory by actually storing the sequence. We just hash them into
* somewhere around 2^16 hashbuckets and count the occurrences.
There's a secret bit here though: the similarity index ignores \r
characters if the file is considered "not binary" and the \r
is immediately followed by \n
.
The final similarity index score is:
score = (int)(src_copied * MAX_SCORE / max_size);
where src_copied
is the number of hashed chunks (of 64 bytes or up-to-newline) that occurred in the source and then occurred again in the destination, and max_size
is the size, in bytes, of whichever blob is larger. (This byte count does not account for stripped '\r'
characters. Those are merely removed from the 64-or-up-to-newline chunks being hashed.)
The "name score" is really just 1 (same base name) or 0 (different base name), i.e., 1 if the L file is dir/oldbase
and the R file is differentdir/oldbase
, but 0 if the L file is dir/oldbase
and the R file is anything/newbase
. This is used to make Git favor newdir/oldbase
over anything/newbase
when those two files are equally similar.
Creating the pairing queue
The diff_tree_sha1
code calls (through a series of functions) ll_diff_tree_paths
(both are in in tree-diff.c
; I linked only to the final function here). This is a complicated and extremely optimized bit of code (Git spends a lot of time here) so we'll just do a quick overview and ignore the complications (see footnote 2). This code looks partly at the full path names of each blob in each tree (these are the P1,...,Pn items in the comment at the top), and partly at the blob hashes for each of these names. For files that have the same name and the same contents, it does nothing (except in --find-copies-harder
mode in which case it queues all file names). For files that have the same name and different contents, or no L or R name, it calls (through function pointers, stored in opt->pathchange
, opt->change
, and opt->add_remove
) what eventually boil down to diff_change
or diff_addremove
, both in diff.c
. These call diff_queue
, which put the file pair (one of which is a dummy if the file is new or removed) into the pairing queue.
Hence, the short version (if we're not using -C
or --find-copies-harder
), the pairing queue has unpaired files only when there is no original source file in L corresponding to a file in R, or no destination file in R corresponding to a source file in L. With -C
, it has every source file, or every modified source file, listed as well so that they can be scanned for copies (the choice here being based on whether you used --find-copies-harder
).
Reducing this to one file, for --follow
We already noted a shortcut in the diffcore-rename.c
code: it skips over all R file names that are not the one file name we care about. There seem to be some similar hacks in ll_diff_tree_paths
, although I am not sure whether they apply here. The code is also driven in a different way, as noted in footnote 2. When we diff parent P vs child C, and find a rename left in our pairing queue, we then switch out the name of the file we're using as a restriction in our git log -- <path>
: we replace the new name in C with the path for the rename-source in P. Then we just continue diffing as usual, so the next time we compare a P-and-C pair, we're looking for oldpath
instead of newpath
. If we detect oldpath
as being renamed from reallyoldpath
, we switch that name into place again, as before.
Note that all the -B
, -C
, and -M
machinery applies in theory, but the shortcuts may—it's not at all clear to me whether they do—keep some of it from working.
1When using --follow
, Git uses the general diffcore code to run both pair-breaking and copy-detection. The general code is called from the code that wants to do the simplification. See the function try_to_follow_renames
in tree-diff.c
, which calls diffcore_std
in diff.c
. This all eventually calls diff_resolve_rename_copy
which handles the pairing queues. Then try_to_follow_renames
trims the result down to the one interesting file; this is later tested via diff_might_be_rename
as called from diff_tree_sha1
. I think this is all driven from log_tree_commit
, called from either cmd_log_walk
or log_show_early
. This last appears to be an undocumented hack meant for use by some GUI(s).
2The tree matching in git diff
actually accepts a single commit on the output right-hand side, and a list of commits on the input left-hand side, for combined diff purposes. This is how Git manages to show merge commits. It's a bit unclear how --follow
works with merge commits though. See find_paths_generic
in combine-diff.c, which calls diff_tree_sha1
as well. Note that the log --follow
hack happens as a result of calling diff_tree_sha1
, and this combined-diff merge-handling code calls that function once per parent. If the followed name will be changed, though, it has been changed by the time it goes through the second parent. Perhaps this is a bug. What happens if that second parent decides the new name results in another, different rename? Logically, it should choose up to one new name per parent-fork, working in topological order, and consider resolving them again somehow if and when the forks rejoin.
3The second, m
, value in -Bn/m
tells Git when not to run a real diff, and instead just describe the change in a non-renamed file as "delete all the original lines, replace them with all the new lines". That assumes that either the first -B
value ended up not breaking the pairing, or the pairing was re-glued-together due to the -M
value, or glued to a different source as a -C
copy.
4See should_break
for details. This also uses the diffcore-delta.c
code, but in a different way, using the "added" count.