How git similarity index is calculated?

Question

I am new with git and there is something that isn't clear to me. How does git internally know if a file is new file or modified file?
Since git doesn't track files but tracks blobs. Is this related to the similarity index?

Also I encountered the problem, that when moving the file and modifying it, sometimes git recognizes it as a renamed file and sometimes as a new file.
In the case of a small file it will recognize it as a new file and deleted.

How can I "trick" git to mark this case as moved file and not a new and deleted (without doing two different commits - one for the move, another for the changes)?

Blobs are contents of files, and they're tied to the filenames by the tree objects. So Git definitely knows whether something is a new file or an edited file. Renaming OTOH is a different matter - since Git doesn't store rename operation, it has to guess. See [Linus](https://web.archive.org/web/20150209075907/http://permalink.gmane.org:80/gmane.comp.version-control.git/217) talk about this design. — Amadan, Jul 05 '18 at 07:38

score 10 · Answer 1 · answered Jul 05 '18 at 08:06

For a detailed discussion of the computation of the similarity index, see Trying to understand `git diff` and `git mv` rename detection mechanism. Before you do that, though, take note of this:

Each commit is a complete, stand-alone snapshot. A snapshot is a tree of named files and named directories (or folders) containing more files and/or more directories.¹ Given a commit and a full path name path/to/file.ext, Git can extract the appropriate blob contents (as Git calls them) that hold the named file within that commit, without having to look at any other commits.
Any time you ask Git about a snapshot for comparison purposes, you must give Git the hash IDs, or names or other strings that resolve to hash IDs, of two commits—two snapshots. Git in effect extracts each snapshot, one at a time, and then compares the resulting tree-of-files. (Some commands, like git show and git log -p, figure out the parent hash by looking at the child commit, then compare parent and child in that order.)

Thus, Git is always looking at a pair of trees: the left-side (a/) tree might contain a README.txt and the right-side (b/) also contains a README.txt, for instance, while the left-side contains doc.txt and the right-side doesn't have a doc.txt. The left-side commit doesn't have documentation.rst and the right-side does have documentation.rst.

What Git does at this point is to match up files. Two files with the exact same pathname—such as the two README.txt files here—must be "the same" file, so Git looks at the contents of the left-side README.txt and the contents of the right-side README.txt to produce a diff of those two. The technical term for matching up such things is determining the identity of the files. (This is quite a feat in philosophical terms. See The Ship of Thesus for discussion. Unlike the philosophical arguments, in computing, we get a clear and concrete answer. Well, we do until we introduce things like Git's -B or break value in git diff, at least!)

Where there are no names to match up, though, such as doc.txt vs documentation.rst, Git computes a similarity index between each such pair of files, comparing the left-side's files (which at this point seem to be removed when getting to the right-side) to the right-side's files (which right now seem to be new files). Well, that is, Git computes this index if you have turned on rename detection. Rename detection is off by default in Git versions prior to Git version 2.9, and on by default in subsequent versions. Git takes the best matches here, and pairs the files up: if doc.txt is sufficiently similar to documentation.rst, why then, those also must be "the same" file, even though they have different names.

Before Git even bothers with this similarity index trick, it does a first pass to find 100%-identical files. This is much easier than computing the similarity index due to the way Git stores content. Any such exact-matches are paired up and taken out of the list of files that could potentially be paired-up, leaving only files that don't have exact-matches in what Git internally calls the rename queue. So similarity index computation is done only on files whose names are in the rename queue. This computation is relatively expensive (it's O(n²) in the number of files), so for fast git show or git log -p, it's a good idea to commit just the rename first, and then any changes to contents.

¹This is the internal representation—from the outside, you're not supposed to even know or care that Git has stored each directory as a tree entry. In particular, Git likes to claim that it stores only files (not directories), and Git makes it ridiculously hard to store an empty directory. To do so, Git would have to have an empty tree—and it does, but if you try to use it, you get weird effects.

I was about to summon you and then I saw you had answered. "Oh might torek! I summon thee to smite incorrect knowledge" — evolutionxbox, Jul 05 '18 at 08:41

How git similarity index is calculated?

1 Answers1