16

What algorithm Git uses to determine, that some file was renamed?

This is, what git status produced just a few minutes before:

enter image description here

Information marked with yellow box is incorrect. There was actually no such kind of rename. Files views/file/create.php and views/file/index.php were truly deleted half hour after a completely new set of two files -- views/logo/create.php and views/logo/index.php was created.

Both files sets may seem (to Git) quite similar, but the fact remains -- these are not the same, renamed files. This is a complete new set of files, created in different directory about half an hour before deleting first set of files.

Since information provided by Git is incorrect, I'd like to feed my curiosity and that's why I'm asking.

trejder
  • 17,148
  • 27
  • 124
  • 216
  • I agree with Flosculus and just want to add [this](http://fabiensanglard.net/git_code_review/diff.php) article going into a bit more details into the algorithms used in the similarity detection. – wonderb0lt Apr 23 '15 at 12:05
  • 1
    Nice! Four upvotes and 1 star within just 2-3 minutes, on a question, that is a perfect dupe! :> I just love SE community. And... oops... sorry for being an author of that dupe, but my Google was just flooded with cold coffee! – trejder Apr 23 '15 at 12:08

1 Answers1

25

From Wikipedia:

Renames are handled implicitly rather than explicitly. A common complaint with CVS is that it uses the name of a file to identify its revision history, so moving or renaming a file is not possible without either interrupting its history, or renaming the history and thereby making the history inaccurate. Most post-CVS revision control systems solve this by giving a file a unique long-lived name (a sort of inode number) that survives renaming. Git does not record such an identifier, and this is claimed as an advantage.[34][35] Source code files are sometimes split or merged as well as simply renamed,[36] and recording this as a simple rename would freeze an inaccurate description of what happened in the (immutable) history. Git addresses the issue by detecting renames while browsing the history of snapshots rather than recording it when making the snapshot.[37] (Briefly, given a file in revision N, a file of the same name in revision N−1 is its default ancestor. However, when there is no like-named file in revision N−1, Git searches for a file that existed only in revision N−1 and is very similar to the new file.) However, it does require more CPU-intensive work every time history is reviewed, and a number of options to adjust the heuristics. This mechanism does not always work; sometimes a file that is renamed with changes in the same commit is read as a deletion of the old file and the creation of a new file. Developers can work around this limitation by committing the rename and changes separately.

JoeBrockhaus
  • 2,745
  • 2
  • 40
  • 64
Rahul Tripathi
  • 168,305
  • 31
  • 280
  • 331