Git does not store rename operations. Git detects them.
To understand how and why this works—which then lets you understand how, when, and why it doesn't work as well—we need to start at the beginning, with the idea that each commit stores two things: a full snapshot of every file (no folders, just files), and some metadata.
The metadata in any one ordinary commit contains the hash ID of its parent commit (singular: the fact that it has one parent is what makes it an "ordinary commit"). Hence, given any one commit:
git show HEAD
for instance, Git can reach into that commit's metadata and pull out the hash ID of its parent commit. Using that hash ID, Git can reach into its all-commits-database and pull out the files from the old (parent) commit. Using the hash ID that HEAD
resolves to:
git rev-parse HEAD
Git can pull out all the files from the new (child) commit.
Because Git stores files in a special de-duplicated format (which saves a lot of space in the repository), Git can nearly-instantly detect files that have exactly the same content. This means that if the old snapshot has three files f1
, f2
, and f3
, and the new snapshot has three files but they're named f1
, f2-renamed
, and f3
, and the content of all three files is 100% identical, Git almost instantly knocks out f1
and f3
as boring, because they have the same names and the same content, and we're only interested in things that changed.
So, what seems to have changed between parent and child commits, at this point, is that file f2
was deleted, and a new file named f2-renamed
was added. The git diff
or git show
command would show this as:
D f2
A f2-renamed
and in fact if you disable rename detection, using git diff --no-renames
or git show --no-renames
, this is what you will see.
In Git versions predating 2.9, this is the default; in 2.9 and later, --find-renames=50%
is the default. As you might then guess, the --find-renames=xx%
directive tells Git: Any time you see some file deleted and a new, differently-named file created, consider the pair of names as candidates for rename detection.
The exact details of rename detection have changed somewhat over the years, but there's one fairly constant part here. Because the de-duplication can nearly-instantly detect 100%-identical copies of file contents, there's a first pass to check for this. In this case that first pass finds that f2
and f2-renamed
have the exact same content. That's a 100% match and no matter how high you set the --find-renames=xx%
number, 100% is good enough, so this D
-and-A
pair becomes one R100
. And that's what you see here:
R100 this_is_the_old_filename.txt this_is_the_new_filename.txt
Git will, however, work harder if necessary. Suppose that besides the A
-and-D
, you also modified the contents, but only slightly. Git now computes a similarity index. This computation is fairly expensive, especially in a large repository with a lot of delete-and-adds that could be paired up, so Git first pairs up and removes all the 100%-matches.
The actual value for the similarity index is rather complicated and is not documented. See Trying to understand `git diff` and `git mv` rename detection mechanism for my analysis. Git will "pair up" any A/D pairs that meet the minimum specified threshold, using the "best match", but there's a bonus for—and now some strategic optimization directed at—carrying multiple renames in a single "directory" (remember, Git doesn't really store directories / folders, but it knows about them, and someone finally decided that Git should have some shortcuts in case a whole directory is renamed: it used to just give a one-percentage-point bump if the last component of the path name matched and I have not looked to see what it does now).
git diff --name-status -C HEAD^..HEAD
Aside from the -C
here, this is basically the same as the git show
example I used above. Git finds the hash ID for HEAD^
and the hash ID for HEAD
and compares the two snapshots. The -C
flag tells Git that it should also search for "copies"; this searching is expensive, so by default it's limited, but you can add --find-copies-harder
or multiple -C
options. This too uses the similarity-index values. In your case, since the rename is quickly found and removed from the rename-and-copy-candidates queues, the -C
flag probably has no effect at all (this depends on whether there are other files detected as A
dd-ed).
Potential surprises
024d72b..HEAD
For git log
or git rev-list
, this kind of two-dot range expression "means" HEAD ^024d72b
. In fact, many other commands handle this similarly: for instance, git cherry-pick
will use a range expression to find multiple commits to copy. The commits found by this kind of range expression are those reachable from the "positive" reference HEAD
, excluding any commits reachable from the "negative" reference ^024d72b
.
There's a similar notation using three dots, X...Y
, which means commits reachable from either X
or Y
, but not commits reachable from both. Both the two and three dot syntaxes (syntaces? ) are described in the gitrevisions documentation, which is pretty central to Git and deserves an occasional re-read, or even some deep study if you haven't done that yet.
Alas! The git diff
commands is, well, different. It supports both the two-dot and three-dot syntaxes as command line options, but neither one means what it normally means. Instead, git diff X..Y
is just an exact synonym for the one-character-shorter git diff X Y
. That is, Git will resolve both X
and Y
to commit hash IDs, fish out the two snapshots, and feed them to its diff engine. (Meanwhile git diff X...Y
means find the merge base between X
and Y
, and use that on the left and Y
on the right. This is pretty darn useful and there's no shorter way to express it.)
When you use:
git diff --name-status -C 024d72b..HEAD
you're getting a commit (024d72b
) in which the content of this_is_the_old_filename.txt
differs quite a lot from the content of this_is_the_new_filename.txt
in the HEAD
commit. As such, the similarity index between the two files has fallen below 50%, which is the default --find-renames
or -M
value.
You can, if you like, specify a lower value. However, the lower you set the value, the more likely Git is to falsely detect a rename. If some file has a long line of =
and -
characters due to being a markdown file, and some other unrelated markdown file has a long line of =
and -
characters, the two files might be 5% similar or something—even though they were never renamed. If only the old file exists in the old commit on the left, and only the new file exists in the new commit on the right, a very-low -M1%
might consider this a "rename".
The default value of 50% was chosen without a lot of complicated analysis, but it turns out to work pretty well for a lot of real projects. Because of the way the similarity index computation works, it only tends to fail on very small files, where the percentage number will swing wildly around based on the number of mini-chunk matches.
What you can do about this
In general, rename-finding works much better when comparing parent and child from ordinary commits. As such, git log
has git log --follow
, which does this for one file name. It walks through the commits—the history—and compares each child against its parent, looking for renames that involve the one specified name on the right. (You must supply the "new" name.) Upon finding one, git log
prints out the commit in the usual way, and then resumes walking backwards through history, except that it's now looking for the previous name.
There are some major flaws here, including the fact that you have to know the new name: knowing the old name is no help as --follow
does not work with --reverse
. You can only use one file name at a time, too. The worst issue is that this kind of git log
uses history simplification to avoid searching down some of the commit paths, and Git will not detect renames that occur within a merge commit, so there are some cases where this just doesn't work. Adding --full-history
and/or -m
does not really fix the problem as Git doesn't switch back to the new name when searching the "other leg" of a merge.
Still, as crappy tools go, git log --follow
is a pretty good crappy tool. I once tried to make it better and this turns out to be a hard problem, which is probably why it's still a little shoddy like this.