2

I would like to use git-diff's "similarity index" calculation feature for files outside of git repository.

Here is the example output of git diff for files not tracked by git (first diff, i.e. what I get) and tracked by git (second diff, i.e. what I would like, but for external files as well)

$ seq 1 3 > file1 ; cp file1 file2 ; echo 4 >> file2            # create files
$ git diff -C file1 file2                                       # show diff (no repo, -C has no effect)
diff --git 1/file1 2/file2
index 01e79c32a8c9..94ebaf900161 100644
--- 1/file1
+++ 2/file2
@@ -1,3 +1,4 @@
 1
 2
 3
+4
$ git init > /dev/null                                          # create repo
(master #%)$ (git add file1; git commit -m file1) > /dev/null   # add file1
(master %)$ (git add file2; git commit -m file2) > /dev/null    # add file1
(master %)$ git diff -C HEAD^                                   # show diff (in repo, -C works)
diff --git c/file1 w/file2
similarity index 75%
copy from file1
copy to file2
index 01e79c32a8c9..94ebaf900161 100644
--- c/file1
+++ w/file2
@@ -1,3 +1,4 @@
 1
 2
 3
+4

I have already seen those questions:

and some other related.

I read git diff manual and even some git's diff source code and it looks like similarity index is always shown for renamed (status R) or copied (C) files, and only sometimes for modified (M) ones:

Status letters C and R are always followed by a score (denoting the percentage of similarity between the source and target of the move or copy). Status letter M may be followed by a score (denoting the percentage of dissimilarity) for file rewrites.

So far I found no way of forcing git to treat external files as copies (--find-copies/-C) or renames (--find-renames/-M) and unfortunately it is not explained in the manual (also not too obvious after looking at the source code either) when the score is shown for status M (modified), which is used when comparing files outside of repo (status can be seen with --raw option).

Is this possible at all?

Or would it require adding new options to git-diff (maybe --assume-copy) to force the required status?

DEVoytas
  • 61
  • 5

1 Answers1

1

There is no way to trigger Git to run its similarity-index computation on files that are not in either the index or a Git tree object. That's a bit of a shame since such an option wouldn't be particularly hard to code, and it would be nice to be able to ask Git How similar are files X and Y? for any arbitrary pair of files, in the repository or not.

That said, if you have two files that aren't committed, and you want Git to compute a similarity index for them, you can just create two commits, or simply two trees, that contain nothing but those two files. There's no front end command to do this but it's not difficult to build your own. Here is a script fragment to do this:

#! /bin/sh -e

export GIT_INDEX_FILE=$(mktemp)
rm $GIT_INDEX_FILE
trap "rm -f $GIT_INDEX_FILE" 0 1 2 3 15
hash=$(git hash-object -t blob -w /tmp/file1)
git update-index --add --cacheinfo 100644,$hash,file
tree1=$(git write-tree)
hash=$(git hash-object -t blob -w /tmp/file2)
git update-index --add --cacheinfo 100644,$hash,file
tree2=$(git write-tree)

We now need to tell Git to compare the two trees:

git diff-tree $tree1 $tree2

This won't trigger the similarity computation, though. In theory, adding -B should do that, but I could not get it to work.

What I did get to work was to invoke the rename detector by using two names for the files, and adding an explicit -M. There must be some match, otherwise you just get D-and-A. You must also either remove the temporary index file between the two git update-index operations, or explicitly clear out the file1 entry:

#! /bin/sh -e
export GIT_INDEX_FILE=$(mktemp)
rm $GIT_INDEX_FILE
trap "rm -f $GIT_INDEX_FILE" 0 1 2 3 15
hash1=$(git hash-object -t blob -w /tmp/file1)
git update-index --add --cacheinfo 100644,$hash1,file1
tree1=$(git write-tree)
rm -f $GIT_INDEX_FILE
hash2=$(git hash-object -t blob -w /tmp/file2)
git update-index --add --cacheinfo 100644,$hash2,file2
tree2=$(git write-tree)

git diff-tree -M1% $tree1 $tree2

Running this on two files, /tmp/file1 and /tmp/file2, with one line that matches, I got:

$ /tmp/foo.sh
:100644 100644 2175e89fddda9d80aa15f579dba8605d5ed84af4 a63117dbbc7985b3984daa948aa87eaed8ea89ad R066   file1   file2

The computed similarity index numbers are quite odd because Git's similarity index computation itself is weird:

aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
minimum

vs

babababababababababababababababababababababababababababababababababaa
minimum

gives a similarity of 010, while making the first-line of file2 all-b characters gives a similarity of 092. The matching minimum line is required, otherwise the files just don't match at all and this becomes a delete-and-add.

torek
  • 448,244
  • 59
  • 642
  • 775