For a detailed discussion of the computation of the similarity index, see Trying to understand `git diff` and `git mv` rename detection mechanism. Before you do that, though, take note of this:
Each commit is a complete, stand-alone snapshot. A snapshot is a tree of named files and named directories (or folders) containing more files and/or more directories.1 Given a commit and a full path name path/to/file.ext
, Git can extract the appropriate blob contents (as Git calls them) that hold the named file within that commit, without having to look at any other commits.
Any time you ask Git about a snapshot for comparison purposes, you must give Git the hash IDs, or names or other strings that resolve to hash IDs, of two commits—two snapshots. Git in effect extracts each snapshot, one at a time, and then compares the resulting tree-of-files. (Some commands, like git show
and git log -p
, figure out the parent hash by looking at the child commit, then compare parent and child in that order.)
Thus, Git is always looking at a pair of trees: the left-side (a/
) tree might contain a README.txt
and the right-side (b/
) also contains a README.txt
, for instance, while the left-side contains doc.txt
and the right-side doesn't have a doc.txt
. The left-side commit doesn't have documentation.rst
and the right-side does have documentation.rst
.
What Git does at this point is to match up files. Two files with the exact same pathname—such as the two README.txt
files here—must be "the same" file, so Git looks at the contents of the left-side README.txt
and the contents of the right-side README.txt
to produce a diff of those two. The technical term for matching up such things is determining the identity of the files. (This is quite a feat in philosophical terms. See The Ship of Thesus for discussion. Unlike the philosophical arguments, in computing, we get a clear and concrete answer. Well, we do until we introduce things like Git's -B
or break value in git diff
, at least!)
Where there are no names to match up, though, such as doc.txt
vs documentation.rst
, Git computes a similarity index between each such pair of files, comparing the left-side's files (which at this point seem to be removed when getting to the right-side) to the right-side's files (which right now seem to be new files). Well, that is, Git computes this index if you have turned on rename detection. Rename detection is off by default in Git versions prior to Git version 2.9, and on by default in subsequent versions. Git takes the best matches here, and pairs the files up: if doc.txt
is sufficiently similar to documentation.rst
, why then, those also must be "the same" file, even though they have different names.
Before Git even bothers with this similarity index trick, it does a first pass to find 100%-identical files. This is much easier than computing the similarity index due to the way Git stores content. Any such exact-matches are paired up and taken out of the list of files that could potentially be paired-up, leaving only files that don't have exact-matches in what Git internally calls the rename queue. So similarity index computation is done only on files whose names are in the rename queue. This computation is relatively expensive (it's O(n2) in the number of files), so for fast git show
or git log -p
, it's a good idea to commit just the rename first, and then any changes to contents.
1This is the internal representation—from the outside, you're not supposed to even know or care that Git has stored each directory as a tree entry. In particular, Git likes to claim that it stores only files (not directories), and Git makes it ridiculously hard to store an empty directory. To do so, Git would have to have an empty tree—and it does, but if you try to use it, you get weird effects.