0

I have this one git repo that has hundreds of text files in a directory tree and contains hundreds of commits and 20 branches (some active, some merged and kept for historical reasons).

I renamed a file and commited it in master branch.

I'm particularly interested in the diff --name-status report for a special requirement.

If I check the changes since previous to last commit I can see that the file was renamed:

$ git diff --name-status -C HEAD^..HEAD|grep this_is_the_old_filename.txt
R100    this_is_the_old_filename.txt  this_is_the_new_filename.txt
...rest of output omitted

I can go back in time like with HEAD~10..HEAD, and the rename operation of the file is listed.

But if I go back all the way back with the hash of the very first commit in the repo (which I verified with the git log command and the git rev-list command)

$ git diff --name-status -C 024d72b..HEAD|grep this_is_the_old_filename.txt
A       this_is_the_old_filename.txt
...rest of output omitted

The file only appears once, with the A (added) indicator, even when other files appear with different statuses like A,M,D and R.

The report includes many "R" operations (renamed) on many files but not in the one file I renamed in the last commit.

Why could the cause be of the rename operation not appearing if I use the range 024d72b..HEAD, being 024d72b the first commit in the repo history?

I understand that git works backwards and that branch merging back and forth can mess things up, but am I missunderstanding the functionality of git diff --name-status?

Tulains Córdova
  • 2,559
  • 2
  • 20
  • 33
  • Does the file exist in the root commit? – TTT Aug 29 '22 at 14:07
  • No, it was added later. – Tulains Córdova Aug 29 '22 at 14:08
  • OK, so the old filename doesn't exist in the root commit or the HEAD commit. In that case, my initial thought is that diff shouldn't display the old filename at all, and the new filename should be shown as an add. Do you get the same result if you remove `-C`? – TTT Aug 29 '22 at 14:11
  • @TTT that particular diff command shows both names in case of file renaming, like this: `R100 oldname newname` – Tulains Córdova Aug 29 '22 at 14:27
  • @TTT The same happens if I remove `-C`. I also found out the very commit where the file was added and asked the report starting from the previous commit (the commit before the file was added). The result is the same. – Tulains Córdova Aug 29 '22 at 14:29
  • I just tested this on a new repo with 3 commits and it does what I expect; I can't duplicate what you're seeing. In my example Commit 1 is empty. Commit 2 creates a file. Commit 3 renames the file. Diff 1 and 2 shows A(dd) the *old* file. Diff 2 and 3 shows the R100. Diff 1 and 3 shows A(dd) for the *new* filename only; the old filename is not shown. (This is all what I would expect.) – TTT Aug 29 '22 at 18:03
  • 1
    You may want to update your question with the info you provided in these comments. I see now that something isn't clear. Your second comment is confusing, because you're saying `git diff --name-status -C 024d72b..HEAD|grep this_is_the_old_filename.txt` shows an "A" in the question, but in that comment you say it shows an "R100". Which is it? – TTT Aug 29 '22 at 18:13
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/247655/discussion-between-tulains-cordova-and-ttt). – Tulains Córdova Aug 29 '22 at 18:26

1 Answers1

1

Git does not store rename operations. Git detects them.

To understand how and why this works—which then lets you understand how, when, and why it doesn't work as well—we need to start at the beginning, with the idea that each commit stores two things: a full snapshot of every file (no folders, just files), and some metadata.

The metadata in any one ordinary commit contains the hash ID of its parent commit (singular: the fact that it has one parent is what makes it an "ordinary commit"). Hence, given any one commit:

git show HEAD

for instance, Git can reach into that commit's metadata and pull out the hash ID of its parent commit. Using that hash ID, Git can reach into its all-commits-database and pull out the files from the old (parent) commit. Using the hash ID that HEAD resolves to:

git rev-parse HEAD

Git can pull out all the files from the new (child) commit.

Because Git stores files in a special de-duplicated format (which saves a lot of space in the repository), Git can nearly-instantly detect files that have exactly the same content. This means that if the old snapshot has three files f1, f2, and f3, and the new snapshot has three files but they're named f1, f2-renamed, and f3, and the content of all three files is 100% identical, Git almost instantly knocks out f1 and f3 as boring, because they have the same names and the same content, and we're only interested in things that changed.

So, what seems to have changed between parent and child commits, at this point, is that file f2 was deleted, and a new file named f2-renamed was added. The git diff or git show command would show this as:

D       f2
A       f2-renamed

and in fact if you disable rename detection, using git diff --no-renames or git show --no-renames, this is what you will see.

In Git versions predating 2.9, this is the default; in 2.9 and later, --find-renames=50% is the default. As you might then guess, the --find-renames=xx% directive tells Git: Any time you see some file deleted and a new, differently-named file created, consider the pair of names as candidates for rename detection.

The exact details of rename detection have changed somewhat over the years, but there's one fairly constant part here. Because the de-duplication can nearly-instantly detect 100%-identical copies of file contents, there's a first pass to check for this. In this case that first pass finds that f2 and f2-renamed have the exact same content. That's a 100% match and no matter how high you set the --find-renames=xx% number, 100% is good enough, so this D-and-A pair becomes one R100. And that's what you see here:

R100    this_is_the_old_filename.txt  this_is_the_new_filename.txt

Git will, however, work harder if necessary. Suppose that besides the A-and-D, you also modified the contents, but only slightly. Git now computes a similarity index. This computation is fairly expensive, especially in a large repository with a lot of delete-and-adds that could be paired up, so Git first pairs up and removes all the 100%-matches.

The actual value for the similarity index is rather complicated and is not documented. See Trying to understand `git diff` and `git mv` rename detection mechanism for my analysis. Git will "pair up" any A/D pairs that meet the minimum specified threshold, using the "best match", but there's a bonus for—and now some strategic optimization directed at—carrying multiple renames in a single "directory" (remember, Git doesn't really store directories / folders, but it knows about them, and someone finally decided that Git should have some shortcuts in case a whole directory is renamed: it used to just give a one-percentage-point bump if the last component of the path name matched and I have not looked to see what it does now).

git diff --name-status -C HEAD^..HEAD

Aside from the -C here, this is basically the same as the git show example I used above. Git finds the hash ID for HEAD^ and the hash ID for HEAD and compares the two snapshots. The -C flag tells Git that it should also search for "copies"; this searching is expensive, so by default it's limited, but you can add --find-copies-harder or multiple -C options. This too uses the similarity-index values. In your case, since the rename is quickly found and removed from the rename-and-copy-candidates queues, the -C flag probably has no effect at all (this depends on whether there are other files detected as Add-ed).

Potential surprises

024d72b..HEAD

For git log or git rev-list, this kind of two-dot range expression "means" HEAD ^024d72b. In fact, many other commands handle this similarly: for instance, git cherry-pick will use a range expression to find multiple commits to copy. The commits found by this kind of range expression are those reachable from the "positive" reference HEAD, excluding any commits reachable from the "negative" reference ^024d72b.

There's a similar notation using three dots, X...Y, which means commits reachable from either X or Y, but not commits reachable from both. Both the two and three dot syntaxes (syntaces? ) are described in the gitrevisions documentation, which is pretty central to Git and deserves an occasional re-read, or even some deep study if you haven't done that yet.

Alas! The git diff commands is, well, different. It supports both the two-dot and three-dot syntaxes as command line options, but neither one means what it normally means. Instead, git diff X..Y is just an exact synonym for the one-character-shorter git diff X Y. That is, Git will resolve both X and Y to commit hash IDs, fish out the two snapshots, and feed them to its diff engine. (Meanwhile git diff X...Y means find the merge base between X and Y, and use that on the left and Y on the right. This is pretty darn useful and there's no shorter way to express it.)

When you use:

git diff --name-status -C 024d72b..HEAD

you're getting a commit (024d72b) in which the content of this_is_the_old_filename.txt differs quite a lot from the content of this_is_the_new_filename.txt in the HEAD commit. As such, the similarity index between the two files has fallen below 50%, which is the default --find-renames or -M value.

You can, if you like, specify a lower value. However, the lower you set the value, the more likely Git is to falsely detect a rename. If some file has a long line of = and - characters due to being a markdown file, and some other unrelated markdown file has a long line of = and - characters, the two files might be 5% similar or something—even though they were never renamed. If only the old file exists in the old commit on the left, and only the new file exists in the new commit on the right, a very-low -M1% might consider this a "rename".

The default value of 50% was chosen without a lot of complicated analysis, but it turns out to work pretty well for a lot of real projects. Because of the way the similarity index computation works, it only tends to fail on very small files, where the percentage number will swing wildly around based on the number of mini-chunk matches.

What you can do about this

In general, rename-finding works much better when comparing parent and child from ordinary commits. As such, git log has git log --follow, which does this for one file name. It walks through the commits—the history—and compares each child against its parent, looking for renames that involve the one specified name on the right. (You must supply the "new" name.) Upon finding one, git log prints out the commit in the usual way, and then resumes walking backwards through history, except that it's now looking for the previous name.

There are some major flaws here, including the fact that you have to know the new name: knowing the old name is no help as --follow does not work with --reverse. You can only use one file name at a time, too. The worst issue is that this kind of git log uses history simplification to avoid searching down some of the commit paths, and Git will not detect renames that occur within a merge commit, so there are some cases where this just doesn't work. Adding --full-history and/or -m does not really fix the problem as Git doesn't switch back to the new name when searching the "other leg" of a merge.

Still, as crappy tools go, git log --follow is a pretty good crappy tool. I once tried to make it better and this turns out to be a hard problem, which is probably why it's still a little shoddy like this.

torek
  • 448,244
  • 59
  • 642
  • 775
  • My understanding from the question and the comments is that commit id `024d72b` is the root commit, and the old filename does not exist in that commit. – TTT Aug 30 '22 at 14:28
  • @TTT: well, that would do it too. But the answer above covers more cases :-) – torek Aug 30 '22 at 21:36
  • Hehe. Yeah. This is a good general answer. (I'm thinking maybe there's a typo in the question such that the Add is actually for the new filename.) – TTT Aug 30 '22 at 21:54