1

What algorithm does git diff use to detect similar (copied/renamed) files?

  1. What is the complexity with respect to the number of files and the size of files in the repository?
  2. Does whitespace matter? (e.g. indenting a whole file)
  3. Does it work on text only?
  4. Is there a risk of similarity detection timing out? E.g. in a large repository does changing the similarity index (--find-renames/-M) to a low number risk not finding results that a high index number might have found because more files were considered?)

(Note, this seems to have been the intent of a previous question 6 years ago, but the accepted answer eshewed algorithm discussion.)

Vincent Scheib
  • 17,142
  • 9
  • 61
  • 77
  • In which context, detecting file renames that also introduce modifications? – user229044 Jan 10 '18 at 23:17
  • I suppose to find out whether the files have been modified or not – freude Jan 11 '18 at 00:16
  • If you're looking for the computation that feeds into the, e.g., `R89` status from `git diff --find-renames --name-status commit1 commit2` output, see https://stackoverflow.com/a/46258968/1256452. (There's more that occurs before this point, though, which I have outlined in other answers.) – torek Jan 11 '18 at 00:56
  • If you're looking for the algorithm Git uses to pair up files to do rename detection at all (between various source files), see, e.g., https://stackoverflow.com/a/40352403/1256452 (I think I have other answers that go a bit deeper into the details, this is just the first one I turned up in a search). – torek Jan 11 '18 at 00:59
  • Related to git diff with similarity `--find-renames`/`-M` for files that move with slight changes, as is common with e.g. C++ files that are moved and need small updates to include paths. Thanks for link to other answers, both useful reads, https://stackoverflow.com/a/46258968/1256452 most so. The curiosity starts when trying to do code reviews and being frustrated when moved files appear in a diff as deleted/added instead of a renamed file with small deltas. Beyond understanding, it would be good to know if any care can be taken to increase the chance a reviewer gets a clean diff. – Vincent Scheib Jan 12 '18 at 04:37

0 Answers0