Short version:
short of poring over
git
's source code, where can I find a full description of the heuristics thatgit
uses to associate chunks of content with specific tracked pathnames?
Detailed version:
In the (Unix) shell demo interaction below, two files, a
and b
, are "git-commit
'ted", then they are modified so as to (effectively) transfer most of a
's content to b
, and finally the two files are once more commited.
The key thing to look for is that the output of the second git commit
ends with the line
rename a => b (99%)
even though no renaming of files (in the usual sense) ever took place (!?!).
Before showing the demo, this brief description may make it easier to follow.
The contents of the files a
and b
are generated by combining the contents of the three auxiliary files, ../A
, ../B
, and ../C
. Symbolically, the states of a
and b
could be represented as
../A + ../C -> a
../B -> b
right before the first commit, and
../A -> a
../B + ../C -> b
right before the second one.
OK, here's the demo.
First, we display the contents of auxiliary files ../A
, ../B
, and ../C
:
head ../A ../B ../C
# ==> ../A <==
# ...
#
# ==> ../B <==
# ###
#
# ==> ../C <==
# =================================================================
# =================================================================
# =================================================================
# =================================================================
# =================================================================
# =================================================================
(Lines beginning with #
correspond to output to the terminal; the actual output lines do not have the leading #
.)
Next, we create files a
and b
, display their contents, and commit them
cat ../A ../C > a
cat ../B > b
head a b
# ==> a <==
# ...
# =================================================================
# =================================================================
# =================================================================
# =================================================================
# =================================================================
# =================================================================
#
# ==> b <==
# ###
git add a b
git commit --allow-empty-message -m ''
# [master (root-commit) 3576df7]
# 2 files changed, 8 insertions(+)
# create mode 100644 a
# create mode 100644 b
Next, we modify files a
and b
, and display their new contents:
cat ../A > a
cat ../B ../C > b
head a b
# ==> a <==
# ...
#
# ==> b <==
# ###
# =================================================================
# =================================================================
# =================================================================
# =================================================================
# =================================================================
# =================================================================
Finally, we commit the modified a
and b
; note the output of git commit
:
git add a b
git commit --allow-empty-message -m ''
# [master 25b806f]
# 2 files changed, 2 insertions(+), 8 deletions(-)
# rewrite a (99%)
# rename a => b (99%)
I rationalize this behavior as follows.
As I understand it, git
treats directory structure info (such as the pathnames of the files it's tracking) as secondary information—or metadata, if you will—, to be associated with the primary information it tracks, namely various chunks of content.
Since both the contents as well as the names (including pathnames) of files may change between commits, git
must use heuristics to associate pathnames to chunks of content. But heuristics, by their very nature, are not guaranteed to work 100% of the time. A failure of such heuristics here takes the form of a history that does not faithfully represent what actually happened (e.g. it reports a file renaming even though no file was renamed, in the usual sense).
A further confirmation of this interpretation (namely, that some heuristics are at play) is that, AFAICT, if the size of the transferred chunk is not sufficiently large, the output of git commit
will not include the rewrite/rename
lines. (I include a demonstration of this case at the end of this post, FWIW.)
My question is this: short of poring over
git
's source code, where can I find a full description of the heuristics thatgit
uses to associate chunks of content with specific tracked pathnames?
This second demo is identical to the first one in every way, except that the auxiliary file ../C
is one line shorter than before.
head ../A ../B ../C
# ==> ../A <==
# ...
#
# ==> ../B <==
# ###
#
# ==> ../C <==
# =================================================================
# =================================================================
# =================================================================
# =================================================================
# =================================================================
cat ../A ../C > a
cat ../B > b
head a b
# ==> a <==
# ...
# =================================================================
# =================================================================
# =================================================================
# =================================================================
# =================================================================
#
# ==> b <==
# ###
git add .
git commit -a --allow-empty-message -m ''
# [master (root-commit) a06a689]
# 2 files changed, 7 insertions(+)
# create mode 100644 a
# create mode 100644 b
cat ../A > a
cat ../B ../C > b
head a b
# ==> a <==
# ...
#
# ==> b <==
# ###
# =================================================================
# =================================================================
# =================================================================
# =================================================================
# =================================================================
git add .
git commit -a --allow-empty-message -m ''
# [master 87415a1]
# 2 files changed, 5 insertions(+), 5 deletions(-)