6

I moved some directories.

When I merge, there are many conflicting files, since other developers have committed their changes. Both egit Merge Tool and git mergetool say that the file was deleted locally or remotely. See image.

How do I merge these changes?

enter image description here

Joshua Fox
  • 18,704
  • 23
  • 87
  • 147
  • This is a mess, because one version of the story is that other developers worked on certain files, and the other story is that these files were deleted (moved). Ideally, no changes should be made to files at the same time you plan on moving or renaming them. – Tim Biegeleisen May 01 '17 at 09:04
  • How many files are in conflict? – Tim Biegeleisen May 01 '17 at 09:08
  • About 150. (I moved my entire source directory in order to work with a specific Maven setup.) – Joshua Fox May 01 '17 at 12:04
  • This could get ugly. You could copy the contents of the files in conflict from their original location (where your collaborators edited them) and overwrite the new location. Better would be to scrap your changes, pull the latest, and then move the folders telling everyone else to stop working until you are done. – Tim Biegeleisen May 01 '17 at 12:16
  • Thank you. I realize that Git does not track file renames. But it does have some capabilities for content tracking. This movements of directories is big enough to require some further development of tooling and some testing; I can't ask other developers to stop working during that time. Is there some special way to move files that would allow the history to be preserved? http://stackoverflow.com/questions/1094269/whats-the-purpose-of-git-mv – Joshua Fox May 01 '17 at 12:22
  • And egit does seem to support this: https://www.eclipse.org/forums/index.php/t/204077/ – Joshua Fox May 01 '17 at 12:28
  • Wait...is your question about how to do this merge, or how to preserve history? – Tim Biegeleisen May 01 '17 at 12:50
  • Primarily -- how to do the merge. But also, I'd like to preserve history. – Joshua Fox May 01 '17 at 13:22

1 Answers1

13

File history and rename detection

You never really need to worry about "preserving history" in Git. Git does not have file history at all, it has only commit history. That is, each commit "points to" (contains the hash ID of) its parent—or, for a merge, both its parents—and this is the history: commit E is preceded by commit D, while commit D is preceded by commit C, and so on. As long as you have the commits, you have the history.

That said, Git can try to synthesize the history of one specific file, using git log --follow. You specify a starting commit and a path name, and Git checks, commit-by-commit, to see if the file was renamed when comparing the current commit's parent to the current commit. This uses Git's rename detection to identify that file a/b.txt in commit L (left) is "the same file" as file c/d.txt in commit R (right).

Rename detection has a lot of fiddly knobs, but at the base level, it's basically this:

  • Git looks at all the file names in commit L.
  • Git looks at all the file names in commit R.
  • If there's a file name that vanishes from L and appears in R, such as a/b.txt is gone and c/d.txt is all-new, why, that's a candidate for a detected rename.
  • Now that there are candidates (unpaired L files and unpaired R files), Git compares the contents of these unpaired files.

Unpaired files go into a pairing queue (one for L, one for R), and Git hashes the contents of all the files. It already has the internal Git hash so it compares all those directly, first. If a file is completely unchanged, it has the same Git hash ID (but different names) in L and R, and can be immediately paired-up and removed from the pairing queues.

Now that exact-matches are taken out, Git tries the long slow slog. It takes one unpaired L file, and computes a "similarity index" for every R file. If some R file is sufficiently similar—or several are—it takes the "most similar" R file and pairs it with the L file. If no file is sufficiently similar, the L file remains unpaired (is taken out of the queue) and is considered "deleted from L". Eventually there are no files in the unpaired L queue, and whatever files remain in the unpaired R queue, those files are "added" (new in R). Meanwhile, all paired-up files have been renamed.

What this means is: When comparing (git diff) commit L to R, if two files are sufficiently similar, they get paired up as a rename. The default similarity index is 50%, so the files need to be a 50% match (whatever that means—the similarity index computation is somewhat opaque), but an exact match is much easier and faster for Git.

Note that git log --follow enables rename detection (on just one target R file, as we're working backwards through the log, comparing the parent commit to just the one file whose name we know in the child). Since Git version 2.9, both git diff and git log -p now have rename detection turned on automatically. In older versions, you had to use the -M option to set the similarity threshold, or configure diff.renames to true, to get git diff and git log -p to do rename detection.

There is also a maximum length for the pairing queues. This has been doubled twice, once in Git 1.5.6 and once in Git 1.7.5. You can control it yourself: it is configurable as diff.renameLimit and merge.renameLimit. The current limits are 400 and 1000. (If you set these to zero, Git uses its own internal maximum, which can chew up enormous amounts of CPU time—that's why these two limits exist in the first place. If you set diff.renameLimit but not merge.renameLimit, git merge uses your diff setting.)

This leads to a rule of thumb that applies to git log --follow: If possible, when you intend to rename some file or set of files, commit the rename step by itself, without changing any of the file contents. If possible, keep the number of renamed files fairly small: at or below 400, for instance. You can commit more renames in multiple steps, 400 at a time. But remember that you're trading off git log --follow ability and speed against cluttering up your history with pointless commits: if you need to rename 50000 files, maybe you should just do it.

But how does this affect merging? Well, git merge, like git log --follow, does always turn on rename detection. But which commit is L and which commit or commits are R?

Merging and rename detection

Whenever you run:

git merge <commit-specifier>

Git has to find the merge base between your current (HEAD) commit and the specified other commit. (Usually this is just git merge <branchname>. That selects the tip commit of that other branch by resolving the branch name to the commit to which it points. By the definition of "branch name" in Git, that's the tip commit of that branch, so that this "just works". But you can specify any commit by hash ID, for instance.) Let's call this merge base commit B (for base). We already know that our own commit is HEAD, though some things call this "local". Let's call the other commit O (for other), though some things call this "remote" (which is silly: nothing in Git is remote!).

Git then does, in effect, two git diffs. One compares B vs HEAD, so for this particular diff, L is B and R is HEAD. Git will detect, or fail to detect, renames according to the rules we saw above. Then Git does the other git diff, which compares B to O. Git will detect or fail to detect renames according to the same rules yet again.

If some file is renamed in B-vs-HEAD, Git diffs its contents as usual. If some file is renamed in B-vs-O, Git diffs its contents as usual. If a single B file F is renamed to two different names in HEAD and O, Git declares a rename/rename conflict on that file, and leaves both names in the work-tree for you to clean up. If it's renamed in only one diff—it's still called F in either HEAD or O—then Git stores the file in the work-tree using the new name from whichever side renamed it. In any case, Git tries to combine the two sets of changes (from B-vs-HEAD and B-vs-O) as usual.1

Of course, for Git to detect the rename, the contents of the file must be sufficiently similar, as always. This is particularly problematic for Java files (and sometimes Python as well), where the file names become embedded in import statements. If a module consists mostly of import statements, with just a few lines of code of their own, the rename-induced changes will overwhelm the remaining file contents, and the files will not be even a 50% match.

There is a solution, though it is a bit ugly. As with the rule of thumb for git log --follow, we can commit just the renames first, and then commit the content-changing "fix all the imports" as a separate commit. Then, when we go to merge, we can do two or even three merges:

git checkout ...  # whatever branch we plan to merge into
git merge <hash>  # merge with everything just before the Great Renaming

Since no files are renamed, this merge will go as well, or as poorly, as usual. Here's the result, in graph form. Note that the hash we supplied to the git merge command was the hash of commit A, just before R that does all the renames:

...--*--o--...--o--M    <-- mainline
      \           /
       o--o--...-A--R--...--o   <-- develop, with renames at R

Then:

git merge <hash of R>

Since every file's content is completely identical, name-wise, across the other R commit—the merge base is commit A—the effect here is merely to pick up all the renames. We keep the file contents from HEAD commit M, but the names from R. This merge should succeed automatically:

...--*--o--...--o--M--N    <-- mainline
      \           /  /
       o--o--...-A--R--...--o   <-- develop, with renames at R

and now we can git merge develop to proceed to merge the development branch.

In many cases, we won't need to make merge M, but it might not be a bad idea to do it anyway if we need to make merge N just for all the renames. The reason is that commit R is not functional: it has the wrong names for imports. Commit R must be skipped during bisection. This means that merge N is similarly non-functional and must be skipped during bisection. It might be good to have M present, since M could actually work.

Note that if you do any of this, you are distorting / contorting your source code just to please your version control system. This is not a good situation. It may be less bad than your other alternatives, but don't tell yourself it's good.


1I still need to see what happens to the two copies of the file when there is a rename/rename conflict. Since Git leaves both names in the work-tree, do both names contain the same merged contents, plus any conflict markers if needed? That is, if the file was named base.txt and is now named head.txt and other.txt, do the work-tree versions of head.txt and other.txt always match?

torek
  • 448,244
  • 59
  • 642
  • 775
  • Thank you! Very thorough. I need to move my *src* directory with no change to content (for a filesystem structure for Maven-based tools), Can I use Eclipse to move the files, then commit in Eclipse? It seems that even for hundreds of files, an identical pair will be found for each. Or should I have a script that does *git mv* and commit for each individual file. I'd have hundreds of commits but guarantee that each file is paired. After doing this, developers on the pre-move branch will be editing files and so will I, but I will be able to easily merge. Does that make sense? – Joshua Fox May 03 '17 at 06:48
  • I know nothing of Eclipse. As long as it just *runs* Git commands you should get the same behavior, but it seems from other remarks and questions that eGit is its own Java implementation of Git, so it may have its own different quirks. For instance, it might do its own pairing, with different limits. I would note again that creating many commits just to keep the VCS happy is not good: if it all works with just one such commit, that's better. – torek May 03 '17 at 07:10