0

I'm sure I'm doing something wrong here, but I'm not sure what.

I have a master and a branch, which will ultimately be merged back in, but for now development is happening in both.

This means I regularly merge the latest changes from master into branch.

The problem is that in the branch includes a lot of file moves and renames.

My current process is:

  • in branch
    • Rename my-control.html to my-control.js
    • Stage the change and commit - Git picks up that it's a move and not a delete+add
    • Update my-control.js
    • Commit my-control.js changes.
    • my-control.js now has the new changes and the history from my-control.html
  • in master
    • Make a change to my-control.html
    • Commit the change
  • back in branch
    • Merge changes from master

And this is where the issues happen - sometimes I get the changes to my-control.js that I expect, but about half the time I just get my-control.html back in branch.

When this happens my-control.js has all the history, and my-control.html has all the history plus 1 or 2 commits from master.

  • What am I doing wrong?
  • Why does this sometimes happen and sometimes work?
  • What can I do to fix it?
  • Is there a way to tell Git "no, these changes should apply to this file"?
Keith
  • 150,284
  • 78
  • 298
  • 434
  • 2
    Git has certain rules for tracking files across renames, including some fuzzy logic. Bottom line is that you should generally avoid renaming/moving files, because there is no guarantee that the history will be preserved in Git. – Tim Biegeleisen Sep 21 '18 at 07:24
  • Agree with @TimBiegeleisen. You are playing with git fire. – eftshift0 Sep 21 '18 at 07:32
  • @TimBiegeleisen we generally would, it's not an option in this case. I've done this in CVS, SVN and TFS in the past without these issues - if what you're saying is correct this is a rather basic bug in Git that _needs_ fixing. – Keith Sep 21 '18 at 08:11
  • @eftshift0 welcome to SO. You can upvote others' comments to avoid "I agree with X" spam - look for the little up arrow to the left of the comment. – Keith Sep 21 '18 at 08:12

1 Answers1

1

Background: file identity

This really all comes down to what I call file identity, which is a difficult problem—not just in Git, it's difficult overall: see the Wikipedia article on the philosophical issue. Git, however, makes it particularly tricky, because:

When this happens my-control.js has all the history, and my-control.html has all the history plus 1 or 2 commits from master.

Git does not have file history. Git has only commit history. More precisely, commits are the history and files are not relevant to this. The commits contain files, but that does not control the history in any way: the commits are history.

I have more about this in, e.g., my answer to Missing deletion of lines in file history (git). If you ask Git to --follow a file across renames, Git uses its history simplification to show only commits that touch the named file—and when one of those "touches" is "Git detects a rename", Git starts looking for the new name at that point, and stops looking for the old one. (Or, since Git is going backwards, it might be better to say that it starts looking for the old name and stops looking for the new one.)

This technique rather obviously can fail at merges, since one leg of the merge may have the "wrong" name. However, history simplification generally goes down only one leg of the merge anyway!

If you don't use --follow, but do use git log -- path(s) or equivalent, Git simply does not bother detecting the rename: it just does history simplification using the given path or paths.

A slightly tortured analogy

What am I doing wrong?

Nothing, or maybe everything. The issue is that Git sometimes can, and sometimes cannot, identify that a file named Bob at one point and a file named Robert at another point refer to the same guy. It can or cannot correctly identify the file-pair. Are Bob and Robert the same guy, or not?

Why does this sometimes happen and sometimes work?

That, at least, has a solid answer: Git can identify the two files if they're sufficiently similar, and other conditions also hold. That is, you show Git two snapshots with some files ("people") in them and have it guess who's who and who moved around. If there's only one file wearing the label "Bob" in the earlier picture, and one file wearing the label "Robert" in the later picture, Git may be able to detect that they're the same guy, as long as he hasn't lost a limb or gained an extra head or some such. However, if both pictures have guys wearing "Bob" and "Robert" name-tags, Git will assume that the two "Bob"s are the same guy, and the two "Robert"s are the same guy, and that the earlier Bob is never the later Robert, nor vice versa.

Technical: git merge, the commit graph, and git diff --find-renames

Let's take a look at how git merge really works. To get there, we have to start with two things: the commit graph, and git diff --find-renames.

The commit graph is the all-important key to merging. Each commit records the raw hash ID of its parent commit if it's an ordinary commit, or, if it is a merge commit, both (or all1) of its parents. Typically there are only two parents of a merge. Let's draw a bit of commit graph as an example, and pick out a few specific commits to talk about. Rather than using full, big ugly hash IDs, let's use an uppercase letter to specify particular commits (and round dots for less-interesting ones). We'll have branches branch and main, which split apart at commit B but got merged at least once in the past:

          o--o---D--o--o--E   <-- branch
         /        \
...--o--B--o----C--M--o--o--F   <-- main

When we did a merge at commit M (for merge), merging branch into main, the merge base was clear and obvious: the last shared commit was B. Commit B was, and is, on both branches. So the way Git did the merge was this:

git diff --find-renames <hash-of-B> <hash-of-C>   # what we did, on main
git diff --find-renames <hash-of-B> <hash-of-D>   # what they did, on branch

Git then combined the two sets of changes, applied the combined changes to the snapshot saved in B, and made the resulting merge commit M.

Because M is a merge commit, it remembers both C and D. When Git walks through the history—which, remember, is made up of commits—it has to visit both parents whenever it moves backwards from M.

We're now going to run git checkout main; git merge branch. That is, we'll select commit F as our current commit and ask Git to merge commit E into F. Git must now find the merge base: the last commit that was on both branches.

Can you guess which commit is the merge base? It's not B this time!

Finding the merge base is all about reachability, and I'll outsource a more complete discussion to Think Like (a) Git, but the answer here is that by walking back from F through M we can reach D, and walking back from E we can reach D directly along the top line. So D is the merge base this time. Git once again runs two git diff commands:

git diff --find-renames <hash-of-D> <hash-of-F>   # what we did on main
git diff --find-renames <hash-of-D> <hash-of-E>   # what they did on branch

Each diff has a left side, commit D, and a right side, the tip commit of the particular branch. Git finds both sets of changes, including detecting renames. So if there is some file with a different name in the base and tip commit, and Git decided that this is the same file under another name—that, e.g., Bob in the left photo became labeled Robert in the right one—then Git will declare that the file was renamed.

Git will now combine the two sets of changes, using the base (D) snapshot as the base on which the changes are applied. If the changes include "rename a file", Git will do the renaming too. If a file is labeled Bob in the base and Robert in both tips, then both diffs have the same rename, and all is good. If only one change renames the file, the name you get depends on which branch you're on when you do the merge: did we rename Bob to Robert, or did they do it?

Where things go really badly is if Git can't detect the rename. What if Bob lost an arm, and Git doesn't recognize that the guy labeled Robert in one of the photos is the same guy?


1Merges with three or more parents are called octopus merges in Git. Linux has one 66-way merge, of which Linus Torvalds remarked: that's not an octopus, that's a Cthulhu merge.


What you can do about it: the similarity index

What can I do to fix it?

The easiest, by far, is to avoid the renaming. Git believes the labels on the files—the path names—first. If the base commit and both tips all have files named bob.txt, why, that must be the same guy, Bob. There's nothing to get confused about.

The renaming has already happened, though. One way to fix it is to arrange for all future merges to use the new name: if the file should be called robert, make sure every future merge base and future branch-tip call the file robert, and there will be no confusion.

If that's not possible, there's one last hope for automation: Give Git more (or different) information. Make Git smarter, in effect: tell Git that it should match up Bob and Robert even if he has lost all his limbs.

The flag that Git has here is different in git diff vs git merge, but both use the same idea, of setting a similarity index. When Git compares two snapshots (two commits), if some file has gone missing from the left and some new file has appeared on the right, Git compares those files' content.

Using git diff --find-renames (or shorter, git diff -M), you can add a similarity index threshold:

git diff -M10

for instance. The number after M (or --find-renames=) is the minimum required similarity index for two files to be considered "the same" file, i.e., for Git to decide which ship is (or was) the Ship of Theseus, or whether a guy wearing a Bob nametag is the same guy as a guy wearing a Robert nametag.

Git's internal computation of the similarity of the two files does not change, but Git's threshold, the point at which it declares that these two different files are really the same file, does. Lowering the threshold makes Git very happy to identify differently-named files. Raising it makes Git more reluctant.

The default similarity threshold is 50%, -M50. Files that are exactly, byte-for-byte, identical are a 100% match. Others are less similar / more dissimilar. The actual formula is in my answer to Trying to understand `git diff` and `git mv` rename detection mechanism but in general the way to find a usable number is to use git diff on the merge base and the two branch tips. Set the threshold very low, run git diff, and Git will tell you which files it matched up and what their actual similarity was.

(To find the merge base, run git merge-base --all commit1 commit2 where the two commit identifiers name the branch tip commits. You can use the branch names here, or raw hash IDs, or anything suitable to Git according to the gitrevisions documentation. You'll then have the hash ID of the base, which you can use as one of the arguments to git diff.)

You can provide the same threshold to git merge, using -X find-renames=number. You can just use a very low number, but this may find too many renames. To find out what Git will think is renamed, use git diff.

If all else fails

If none of the above is sufficient (which can happen), you are not completely out of alternatives:

Is there a way to tell Git "no, these changes should apply to this file"?

There is a totally-manual way to do a file merge, which is:

  • Start the merge, using --no-commit to tell Git that Git should not assume that the merge succeeded.

  • Resolve whatever you can using whatever simpler methods are available.

  • If Git has mis-identified files, extract the merge base version of the file from some known or manually-chosen merge base commit. If not, it's already in the index in a stage-1 slot, so you can extract it from there. Either way, though, extract the file into the work-tree under a name you can use. For instance:

    git show $hash:$basepath > file.base
    

    Likewise, extract the "ours" and "theirs" version of the file into the work-tree:

    git show HEAD:file > file.ours
    git show MERGE_HEAD:$theirpath > file.theirs
    

    Now that you have all three versions of the file, use git merge-file to perform the three-way merge on the file. Once you have the correct merge result in your work-tree, put it under the correct name and use git add to copy it into the index, ready for committing. Make sure to remove from the index any wrong (--theirs) version that's left behind—git status will tell you about such files, if they exist.

When the merge is finished, use git commit (or in new-enough versions of Git, git merge --continue) to finish the merge.

This—manually selecting the three files and using some kind of merge program on them—is how we did this in the bad old days before Git. Welcome to the 1990s! :-)

torek
  • 448,244
  • 59
  • 642
  • 775
  • Thanks! That's a very comprehensive response. It'll take me a while to try it out but +1 for now. We really don't have any choice on the renames (the current project uses HTML imports, which are being deprecated in March next year in favour of ES6 imports - while both are text it's even _more_ of a pain for the rest of the tooling to edit JS in HTML files or HTML in JS files). – Keith Sep 21 '18 at 17:05