What's git's heuristic for assigning content modifications to file paths?

Question

Short version:

short of poring over git's source code, where can I find a full description of the heuristics that git uses to associate chunks of content with specific tracked pathnames?

Detailed version:

In the (Unix) shell demo interaction below, two files, a and b, are "git-commit'ted", then they are modified so as to (effectively) transfer most of a's content to b, and finally the two files are once more commited.

The key thing to look for is that the output of the second git commit ends with the line

rename a => b (99%)

even though no renaming of files (in the usual sense) ever took place (!?!).

Before showing the demo, this brief description may make it easier to follow.

The contents of the files a and b are generated by combining the contents of the three auxiliary files, ../A, ../B, and ../C. Symbolically, the states of a and b could be represented as

../A + ../C -> a
../B        -> b

right before the first commit, and

../A        -> a
../B + ../C -> b

right before the second one.

OK, here's the demo.

First, we display the contents of auxiliary files ../A, ../B, and ../C:

head ../A ../B ../C
# ==> ../A <==
# ...
# 
# ==> ../B <==
# ###
# 
# ==> ../C <==
# =================================================================
# =================================================================
# =================================================================
# =================================================================
# =================================================================
# =================================================================

(Lines beginning with # correspond to output to the terminal; the actual output lines do not have the leading #.)

Next, we create files a and b, display their contents, and commit them

cat ../A ../C > a
cat ../B      > b
head a b
# ==> a <==
# ...
# =================================================================
# =================================================================
# =================================================================
# =================================================================
# =================================================================
# =================================================================
# 
# ==> b <==
# ###

git add a b
git commit --allow-empty-message -m ''
# [master (root-commit) 3576df7] 
#  2 files changed, 8 insertions(+)
#  create mode 100644 a
#  create mode 100644 b

Next, we modify files a and b, and display their new contents:

cat ../A      > a
cat ../B ../C > b
head a b
# ==> a <==
# ...
#
# ==> b <==
# ###
# =================================================================
# =================================================================
# =================================================================
# =================================================================
# =================================================================
# =================================================================

Finally, we commit the modified a and b; note the output of git commit:

git add a b
git commit --allow-empty-message -m ''
# [master 25b806f] 
#  2 files changed, 2 insertions(+), 8 deletions(-)
#  rewrite a (99%)
#  rename a => b (99%)

I rationalize this behavior as follows.

As I understand it, git treats directory structure info (such as the pathnames of the files it's tracking) as secondary information—or metadata, if you will—, to be associated with the primary information it tracks, namely various chunks of content.

Since both the contents as well as the names (including pathnames) of files may change between commits, git must use heuristics to associate pathnames to chunks of content. But heuristics, by their very nature, are not guaranteed to work 100% of the time. A failure of such heuristics here takes the form of a history that does not faithfully represent what actually happened (e.g. it reports a file renaming even though no file was renamed, in the usual sense).

A further confirmation of this interpretation (namely, that some heuristics are at play) is that, AFAICT, if the size of the transferred chunk is not sufficiently large, the output of git commit will not include the rewrite/rename lines. (I include a demonstration of this case at the end of this post, FWIW.)

My question is this: short of poring over git's source code, where can I find a full description of the heuristics that git uses to associate chunks of content with specific tracked pathnames?

This second demo is identical to the first one in every way, except that the auxiliary file ../C is one line shorter than before.

head ../A ../B ../C
# ==> ../A <==
# ...
# 
# ==> ../B <==
# ###
# 
# ==> ../C <==
# =================================================================
# =================================================================
# =================================================================
# =================================================================
# =================================================================

cat ../A ../C > a
cat ../B      > b
head a b
# ==> a <==
# ...
# =================================================================
# =================================================================
# =================================================================
# =================================================================
# =================================================================
# 
# ==> b <==
# ###

git add .
git commit -a --allow-empty-message -m ''
# [master (root-commit) a06a689] 
#  2 files changed, 7 insertions(+)
#  create mode 100644 a
#  create mode 100644 b

cat ../A      > a
cat ../B ../C > b
head a b
# ==> a <==
# ...
# 
# ==> b <==
# ###
# =================================================================
# =================================================================
# =================================================================
# =================================================================
# =================================================================

git add .
git commit -a --allow-empty-message -m ''
# [master 87415a1] 
#  2 files changed, 5 insertions(+), 5 deletions(-)

Note that similarity of files has no effect whatsoever on commits. Commits are *snapshots* of the working directory, where the difference to the previous versions does not matter. The similiarity is solely generated by the porcelain commands on output. — poke, Jan 22 '14 at 20:06

Edward Thomson · Answer 1 · 2014-01-22T21:59:06.260

9

As you noticed, Git performs rename detection using a heuristic, rather than being told that a rename occurred. The git mv command, in fact, simply stages an add on the new file path and a remove of the old file path. Thus, rename detection is performed by comparing the contents of added files to the previously committed contents of deleted files.

First, candidates are collected. Any new files are possible rename targets and any deleted files are possible rename sources. In addition, rewriting changes are broken such that a file that is more than 50% different than its previous revision is both a possible rename source and a possible rename target.

Next, identical renames are detected. If you rename a file without making any changes, then the file will hash identically. These can be detected just performing comparisons of the hash in the index without reading the file contents, so removing these from the candidate list will reduce the number of comparisons you need to perform.

Finally, the similarity comparison is performed. Each line in each candidate file is hashed and collected in a sorted list. Long lines are split at 60 characters. Whitespace only lines may be stripped on the assumption that they don't contribute greatly to the similarity matching. The line hashes from each candidate source are compared to the line hashes from each candidate target. If two lists are 60% similar, they are deemed a rename.

edited Jan 22 '14 at 21:59

answered Jan 22 '14 at 20:12

Edward Thomson

74,857
14
158
187

So it's always by source lines? (Assuming a non-binary diff anyway.) (This *is* a question; I don't know the answer! :-) Not sure that it matters that much either, just curious.) – torek Jan 22 '14 at 20:14
Yes, always by lines, at least in git.git. github.com/libgit2/libgit2 allows for a pluggable similarity detection, though the default implementation uses the core git strategy. – Edward Thomson Jan 22 '14 at 20:16
The splitting of lines, by the way, should facilitate binary rename detection as well! – Edward Thomson Jan 22 '14 at 20:19
1

Assuming they split well, perhaps. A straight rename is much easier to detect, as the SHA-1 for the blobs will match. (I have not looked but I assume this is done first to reduce candidate counts for remaining rename detections.) – torek Jan 22 '14 at 20:21
@torek Definitely so - thank you for reminding me of this very important detail. – Edward Thomson Jan 22 '14 at 21:55

score 2 · Answer 2 · answered Jan 22 '14 at 20:12

... short of poring over git's source code, where can I find a full description of the heuristics that git uses to associate chunks of content with specific tracked pathnames?

Depending on what you mean by "full" I don't think you can find such a thing. (In particular, how are "percentages" calculated? Is it by lines, or characters/bytes, or something else? Does doing a word-oriented diff change things?) But the magic is all inside git diff, where it is computed dynamically every time a diff is to be shown; and the heuristics have several control knobs that give strong clues:

--no-renames

Turn off rename detection, even when the configuration file gives the default to do so.

-B[<n>][/<m>], --break-rewrites[=[<n>][/<m>]]

Break complete rewrite changes into pairs of delete and create. This serves two purposes:

It affects the way a change that amounts to a total rewrite of a file not as a series of deletion and insertion mixed together with a very few lines that happen to match textually as the context, but as a single deletion of everything old followed by a single insertion of everything new, and the number m controls this aspect of the -B option (defaults to 60%). -B/70% specifies that less than 30% of the original should remain in the result for Git to consider it a total rewrite (i.e. otherwise the resulting patch will be a series of deletion and insertion mixed together with context lines).

When used with -M, a totally-rewritten file is also considered as the source of a rename (usually -M only considers a file that disappeared as the source of a rename), and the number n controls this aspect of the -B option (defaults to 50%). -B20% specifies that a change with addition and deletion compared to 20% or more of the file's size are eligible for being picked up as a possible source of a rename to another file.

and so on; see the documentation for git-diff.

By lines. Though long lines get split at some length (60 chars? I forget, offhand) — Edward Thomson, Jan 22 '14 at 20:14

What's git's heuristic for assigning content modifications to file paths?

2 Answers2

Linked