Git's "similarity index" computation is not, as far as I know, documented anywhere other than in the source, starting with diffcore-delta.c.
To compute the similarity index for two files S (source) and D (destination), Git:
- reads both files
- computes a hash table of all of the chunks of file S
- computes a second hash table of all of the chunks of file D
The entries in these two hash tables are simply a count of occurrences of instances of that hash value (plus, as noted below, the length of the chunk).
The hash value for a file chunk is computed by:
- start at the current file offset (initially zero)
- read 64 bytes or until
'\n'
character, whichever occurs first
- if the file is claimed to be text and there is a
'\r'
before the '\n'
, discard the '\r'
- hash the resulting string-of-up-to-64 bytes using the algorithm shown in the linked file
Now that there are hash tables for both S and D, each possible hash hi appears nS times in S and nD in D (either may be zero, though the code skips right over both-zero hash values). If the number of occurrences in D is less than or the same as the number of occurrences in S—i.e., nD ≤ nS—then D "copies from S" nD times. If the number of occurrences in D exceeds the number in S (including when the number in S is zero), then D has a "literal add" of nD - nS occurrences of the hashed chunk, and D also copies all nS original occurrences as well.
Each hashed chunk retains its number-of-input-bytes, and these multiply the number of copies or number of additions of "chunks" to get the number of bytes copied or added. (Deletions, where D lacks items that exist in S, have only indirect effect here: the byte copy and add counts get smaller, but Git does not specifically count the deletions themselves.)
These two values (src_copied
and literal_added
) computed in diffcore_count_changes
are handed over to function estimate_similarity
in diffcore-rename.c
. It completely ignores the literal_added
count (this count is used in deciding how to build packfile deltas, but not in terms of rename scoring). Instead, only the src_copied
number matters:
score = (int)(src_copied * MAX_SCORE / max_size);
where max_size
is the size in bytes of larger of the two input files S and D.
Note that there is an earlier computation:
max_size = ((src->size > dst->size) ? src->size : dst->size);
base_size = ((src->size < dst->size) ? src->size : dst->size);
delta_size = max_size - base_size;
and if the two files have changed size "too much":
if (max_size * (MAX_SCORE-minimum_score) < delta_size * MAX_SCORE)
return 0;
we never even get into the diffcore-delta.c
code to hash them. The minimum_score
here is the argument to -M
or --find-renames
, converted to a scaled number. MAX_SCORE
is 60000.0
(type double
), so the default minimum_score
, when you use the default -M50%
, is 30000 (half of 60000). Except for the case of CR-before-LF eating, though, this particular shortcut should not affect the outcome of the more expensive similarity computation.
[Edit: this is now obsolete:] git status
always uses the default. There is no knob to change the threshold (nor the number of files allowed in the rename-finding queue). If there were the code would go here, setting the rename_score
field of the diff options. Until Git version 2.18.0, there was no way to control this for git status
. In Git 2.18.0 and later, git status
has the same --find-renames
option as git diff
. The status.renames
option in the Git configuration enables any default detection, and if unset, git status
obeys the diff.renames
setting; see the git config
documentation and the git status
documentation.