3

Im trying to pick an id for a file's history -- i'd like it to be or refer to the "object" whose details git log --follow <filename>. I'm wondering:

How does git know that one file is a variant of another across subsequent commits? The name being the same is a strong hint, of course, but it also tracks renaming on commit. Does it keep the results of its calculations somewhere for git log to refer to (where?), or does git log repeat these calculations every time? (And what calculations are these?)

Ideally I'd like to access or recreate the history (list of commits/blob shas) with nodegit.

shaunc
  • 5,317
  • 4
  • 43
  • 58
  • Don't quote me on this (ask @torek) but I actually think Git uses a sort of regex to track filenames across commits. Note that Git has a hard time tracking across renaming; if you're very concerned about a file history then consider _not_ renaming it. – Tim Biegeleisen May 11 '17 at 05:12
  • I just want to figure out how git does it so I can recreate it and/or look it up. (Quality of fit is secondary problem for the moment -- I need a general solution and think git might be a good starting point.) – shaunc May 11 '17 at 05:14
  • Allow me to clarify: Git looks for an exact match, and if that fails then it pulls out a regex. An exact match would only not be found either for new files or for files whose name changed. – Tim Biegeleisen May 11 '17 at 05:24
  • I take it this is a regex for the filenames, to spot renaming (or moving)? It also checks file content, right? If you can put a link/explanation of the algorithm I'd be pleased to accept that as an answer... but want enough details to recreate (or at least look and decide to simplify if its too complex for me). – shaunc May 11 '17 at 05:28
  • Have a look here: http://stackoverflow.com/questions/7938582/how-does-git-detect-similar-files-for-its-rename-detection – Tim Biegeleisen May 11 '17 at 05:33
  • Git does not store in any way, shape, or form, that a rename/move has taken place. This is entirely the tooling on top of the underlying repository. In other words, if you examine the git objects that make up these commits, their trees, and the files they contain, you will not find mention of anything that was renamed or moved. Instead, when necessary, git analyzes the data (again) to (again) figure out what happened. – Lasse V. Karlsen May 11 '17 at 05:46
  • 1
    @TimBiegeleisen: It doesn't apply regex matching to the file names. It *does* do exact content matching first (by hash ID), and then "similarity scoring" by an undocumented algorithm (in [diffcore-delta.c](https://github.com/git/git/blob/master/diffcore-delta.c)) that actually just hashes lines-or-line-fragments. It also checks whether the "base name" of the two files match; if so this boosts the match-chance. This same code is used to find similar blobs when building pack files. – torek May 11 '17 at 06:29
  • @torek Fire an answer away if you feel like it :-) I remembered the part about exact matching and then falling back to something else. Maybe this is enough info for this question. – Tim Biegeleisen May 11 '17 at 06:30
  • @torek, TimBiegeleisen if someone wants to pull together description and references (also see ref by j6t ) I will accept.... – shaunc May 12 '17 at 13:57
  • That's a fair bit of work, or a link-heavy answer :-) (basically we either have to describe the code, or link to the code, in the two diffcore files). Worse, though, is that these things are internal implementation details. In particular `--follow` depends on some shortcuts that only work in the "backwards" direction (I looked at this long ago for `git log --reverse --follow` if you know the *original* name of a file). Some future Git might change some of these details, and then you'd be stuck with an incompatible implementation... – torek May 12 '17 at 17:45
  • @torek -- how about just links with a short summary of how it works (best would be a restatement of "the summary in git internals manual", but it doesn't exist there -- so this would be the "go to place" before wading into the code. :)) – shaunc May 13 '17 at 18:16
  • 1
    @shaunc: I couldn't resist delving deeper. :-) – torek May 14 '17 at 03:09
  • @torek -- wish I could mark as twice answered! – shaunc May 15 '17 at 03:42

2 Answers2

10

Both other people and I have described this in different (and no links) detail elsewhere, e.g., this answer to What's git's heuristic for assigning content modifications to file paths? or my answer to Git Diff of same files in two directories always result in "renamed". The details are slightly different for git log --follow than they are for git diff, since git diff usually deals with an entire tree-full of left and right side files, but git log --follow only works with one particular path.1

In any case, the rename-following happens when comparing two specific commits. For a general git diff they are any two commits R (right side) and L (left side—you choose the two),2 but for git log they are specifically parent and child. Let's call these P and C for convenience. With git log --follow, Git runs an after-diff step (called from diff_tree_sha1; see footnotes) that trims everything down to one file. The diff is done with R=C and L=P. The general case is actually easier to describe, though, so we'll start with that.

Normally, when comparing R vs L, Git:

  • matches up all the tree files that have the same full path name, then
  • puts the remaining files (paths) into a pairing queue.

You can modify this a bit with the -B (pair-breaking) flag, which actually takes two optional integers (-Bn/m). Only the n integer matters for rename detection.3 You can modify it with the -C flag as well; this takes only an optional n, and turns on copy detection. In all cases rename detection must be turned on. Rename detection is enabled via -M, which likewise takes an optional integer n, or automatically in the case of git log --follow and other commands like git status or the post-merge git diff --stat.

In any case, the integer n here is a similarity (or dissimilarity) metric value for all of these various options. This is where we get to the meat of the rename detection code.

Suppose that we have, to start with, a basic git diff <commit1> <commit2> or git diff <tree1> <tree2> operation. This winds up calling builtin_diff_tree in builtin/diff.c, which calls diff_tree_sha1 (which we'll see again later) and then log_tree_diff_flush in log-tree.c. This almost immediately calls diffcore_std in diff.c, which runs the the diffcore_break, diffcore_rename, and diffcore_merge_broken functions if the right (-B, -M or -C, and -B again) options are selected.

These three functions operate on a pairing queue. How does the pairing queue get set up? I'll leave that to another section since it's complicated. For now, just assume that the pairing queue already has path/to/file matched with path/to/file when there's a path/to/file in both L and R, and otherwise has an unpaired path/to/L-only and path/to/R-only for cases where there's a file-path that occurs only in L or only in R.

The diffcore_break function is in diffcore-break.c. Its job is to find already-paired files whose dissimilarity index (when comparing the L and R versions) is above some threshold. If that's the case, it breaks the pairing. The diffcore_merge function is just below it in the same file; it rejoins a broken-up pair if neither half has found a "better mate". The dissimilarity index computation is similar to, but not the same as, the similarity computation.4

Similarity detection

The more interesting diffcore_rename function is in diffcore-rename.c. It has a special case shortcut for --follow that we can ignore for now. It then looks for exact renames, i.e., files whose blob hashes match, even though their names don't. There are some fiddly bits for using "the next file" if multiple L sources have the same hash as some unpaired R destination, too.

Next, it checks how many unpaired entries there are, because it is going to (in effect) do num(L) times num(R) comparisons of files to compute their similarities, and this is going to take a lot of time and space. It will even automatically downgrade a --find-copies-harder case that is "too hard". Then, for each possible L and R pairing, it computes a similarity index and a name score.

The similarity index code is in estimate_similarity in diffcore-rename.c. It relies on the function diffcore_count_changes in diffcore-delta.c, which says this (I'm copying it straight from the file since it's one of the core metrics):

 * Idea here is very simple.
 *
 * Almost all data we are interested in are text, but sometimes we have
 * to deal with binary data.  So we cut them into chunks delimited by
 * LF byte, or 64-byte sequence, whichever comes first, and hash them.
 *
 * For those chunks, if the source buffer has more instances of it
 * than the destination buffer, that means the difference are the
 * number of bytes not copied from source to destination.  If the
 * counts are the same, everything was copied from source to
 * destination.  If the destination has more, everything was copied,
 * and destination added more.
 *
 * We are doing an approximation so we do not really have to waste
 * memory by actually storing the sequence.  We just hash them into
 * somewhere around 2^16 hashbuckets and count the occurrences.

There's a secret bit here though: the similarity index ignores \r characters if the file is considered "not binary" and the \r is immediately followed by \n.

The final similarity index score is:

score = (int)(src_copied * MAX_SCORE / max_size);

where src_copied is the number of hashed chunks (of 64 bytes or up-to-newline) that occurred in the source and then occurred again in the destination, and max_size is the size, in bytes, of whichever blob is larger. (This byte count does not account for stripped '\r' characters. Those are merely removed from the 64-or-up-to-newline chunks being hashed.)

The "name score" is really just 1 (same base name) or 0 (different base name), i.e., 1 if the L file is dir/oldbase and the R file is differentdir/oldbase, but 0 if the L file is dir/oldbase and the R file is anything/newbase. This is used to make Git favor newdir/oldbase over anything/newbase when those two files are equally similar.

Creating the pairing queue

The diff_tree_sha1 code calls (through a series of functions) ll_diff_tree_paths (both are in in tree-diff.c; I linked only to the final function here). This is a complicated and extremely optimized bit of code (Git spends a lot of time here) so we'll just do a quick overview and ignore the complications (see footnote 2). This code looks partly at the full path names of each blob in each tree (these are the P1,...,Pn items in the comment at the top), and partly at the blob hashes for each of these names. For files that have the same name and the same contents, it does nothing (except in --find-copies-harder mode in which case it queues all file names). For files that have the same name and different contents, or no L or R name, it calls (through function pointers, stored in opt->pathchange, opt->change, and opt->add_remove) what eventually boil down to diff_change or diff_addremove, both in diff.c. These call diff_queue, which put the file pair (one of which is a dummy if the file is new or removed) into the pairing queue.

Hence, the short version (if we're not using -C or --find-copies-harder), the pairing queue has unpaired files only when there is no original source file in L corresponding to a file in R, or no destination file in R corresponding to a source file in L. With -C, it has every source file, or every modified source file, listed as well so that they can be scanned for copies (the choice here being based on whether you used --find-copies-harder).

Reducing this to one file, for --follow

We already noted a shortcut in the diffcore-rename.c code: it skips over all R file names that are not the one file name we care about. There seem to be some similar hacks in ll_diff_tree_paths, although I am not sure whether they apply here. The code is also driven in a different way, as noted in footnote 2. When we diff parent P vs child C, and find a rename left in our pairing queue, we then switch out the name of the file we're using as a restriction in our git log -- <path>: we replace the new name in C with the path for the rename-source in P. Then we just continue diffing as usual, so the next time we compare a P-and-C pair, we're looking for oldpath instead of newpath. If we detect oldpath as being renamed from reallyoldpath, we switch that name into place again, as before.

Note that all the -B, -C, and -M machinery applies in theory, but the shortcuts may—it's not at all clear to me whether they do—keep some of it from working.


1When using --follow, Git uses the general diffcore code to run both pair-breaking and copy-detection. The general code is called from the code that wants to do the simplification. See the function try_to_follow_renames in tree-diff.c, which calls diffcore_std in diff.c. This all eventually calls diff_resolve_rename_copy which handles the pairing queues. Then try_to_follow_renames trims the result down to the one interesting file; this is later tested via diff_might_be_rename as called from diff_tree_sha1. I think this is all driven from log_tree_commit, called from either cmd_log_walk or log_show_early. This last appears to be an undocumented hack meant for use by some GUI(s).

2The tree matching in git diff actually accepts a single commit on the output right-hand side, and a list of commits on the input left-hand side, for combined diff purposes. This is how Git manages to show merge commits. It's a bit unclear how --follow works with merge commits though. See find_paths_generic in combine-diff.c, which calls diff_tree_sha1 as well. Note that the log --follow hack happens as a result of calling diff_tree_sha1, and this combined-diff merge-handling code calls that function once per parent. If the followed name will be changed, though, it has been changed by the time it goes through the second parent. Perhaps this is a bug. What happens if that second parent decides the new name results in another, different rename? Logically, it should choose up to one new name per parent-fork, working in topological order, and consider resolving them again somehow if and when the forks rejoin.

3The second, m, value in -Bn/m tells Git when not to run a real diff, and instead just describe the change in a non-renamed file as "delete all the original lines, replace them with all the new lines". That assumes that either the first -Bvalue ended up not breaking the pairing, or the pairing was re-glued-together due to the -M value, or glued to a different source as a -C copy.

4See should_break for details. This also uses the diffcore-delta.c code, but in a different way, using the "added" count.

Community
  • 1
  • 1
torek
  • 448,244
  • 59
  • 642
  • 775
1

git log repeats the calculations every time.

The decision is based on file contents. When one file goes away, and another file appears, Git compares the contents and decides that the file was renamed if the contents are equal or very similar (by some measure).

j6t
  • 9,150
  • 1
  • 15
  • 35
  • ok ... so... is there an algorithm? I tried looking in git internals, but perhaps I didn't know where to look. A reference would be great. – shaunc May 11 '17 at 05:19
  • 2
    Look for [diffcore-rename.c](https://github.com/git/git/blob/master/diffcore-rename.c). – j6t May 11 '17 at 05:53