Preserving deletes on rebasing in a fork

Question

I forked a codebase and removed a lot of code. Every now and then, I rebase my code against the original project to receive updates. In some of these updates, files I deleted in my fork are reintroduced and I'm asked by git rebase to resolve these conflicts, leading me to manually git rm these files.

Is there a way to tell git rebase "if I already removed these files in my fork, don't ever reintroduce them"?

Looks like [`git rerere`](https://git-scm.com/docs/git-rerere) can help you. There's also a very well written related [SO post](https://stackoverflow.com/a/49501436/2915738) — Asif Kamran Malick, Apr 04 '21 at 11:52
That's not really what I'm looking for. Over time it will reduce me going back to those files and redeleting them, but it also means I have to delete every file once for git to remember the conflict resolution. In addition, I am not interested in recording _changes in files_, just deletions. That means even more manual interference via git rerere forget for every merge that didn't result in file deletion. I'm looking for a general configuration option (if such exists) that applies only to deleted files. — FTA, Apr 04 '21 at 12:19
Are the files you deleted still being edited in the original repo (after you forked)? And if yes, you don't care about those edits- you just want the files gone? — TTT, Apr 06 '21 at 15:38

score 1 · Answer 1 · answered Apr 04 '21 at 23:28

The short answer is a simple (albeit not very satisfying) no. In fact, git rerere won't help either, for two reasons:

It is just for in-file conflicts, not for "high level" or "tree level" conflicts.
A reintroduction of the file won't cause a conflict at all.

That said, this "reintroduction of a file" claim is not quite right. What you're getting is the high level (or tree level) conflict mentioned in point 1 above. To understand this, we need to look at how rebase works as a series of git cherry-pick operations, and thus at how one git cherry-pick operation works.

(For what you can do, jump to the end of this answer.)

Capsule summary of rebase

I'm going to skip most of the rebase detail, and just note that git rebase:

enumerates some set of commits to copy;
does a detached HEAD git checkout (or git switch --detach, in Git 2.23 or later) of the place at which the copied commits should land;
copies each commit, one at a time, as if by, or sometimes literally by, invoking git cherry-pick; and
once all commits are copied, move the branch name around as if by git branch -f, then re-attach HEAD to that branch name.

The result is that if we start with, e.g.:

          I--J--K   <-- ourbranch (HEAD)
         /
...--G--H--L   <-- updated-upstream

and run git rebase updated-upstream, we get:

          I--J--K   [abandoned]
         /
...--G--H--L   <-- updated-upstream
            \
             I'-J'-K'  <-- ourbranch (HEAD)

where I', J', and K' are the copies of our commits I-J-K made by the three cherry-pick operations. The original commits still exist (and can be recovered for a while, in case the rebase went badly). It's just that they're harder to find now, because the name ourbranch now locates commit K'—the new and improved(?) copy—instead of the original commit K.

The keys to rebasing are to make sure that the set of commits enumerated in step 1 is correct, that the position in step 2 is correct, and that each copy in step 3 is correct. The branch name fiddling in step 4 is the most visible, but least important, thing in the process. This is because Git is all about commits; branch names just serve to find the commits. (The finding step is important of course! If we can't find a commit, what good is it? But there are other ways to find commits, and once we do find them, it's the commits that matter.)

A cherry-pick is a merge

Because a cherry-pick operation is a merge—though one with a twist—we should look at a normal merge first. Again, this is just a capsule summary that hides a ton of important detail, but we start with a series of commits like this:

          I--J   <-- branch1 (HEAD)
         /
...--G--H
         \
          K--L   <-- branch2

This diagram means that we are "on branch branch1", as git status would say. The current commit is commit J: that's the source for the set of files we have checked out in our working tree. The current branch is branch1. The name branch1 selects commit J (whatever its actual big ugly hash ID is).

Commit J, like every commit, has a snapshot and some set of parents. Like most commits, it has just one parent, in this case commit I. Commit I has one parent, commit H; commit H has one parent, commit G; and so on. Meanwhile the other branch name in the diagram—branch2—selects commit L. Commit L has a snapshot and a single parent K. Commit K has one parent, H. From this point on backwards—remember, Git works backwards—everything is the same as for branch branch1.

All of this means that commit H—which, like every commit, has a snapshot—is the best shared commit on both branches. Commits before H, like G, are also shared, but they're not as good because they're more-far-away, as it were, from the two branch-tip commits J and L. Git will find commit H by starting at J and L and working backwards as usual. Since it's on both branches, and is the best such commit, Git will use commit H as the merge base for a merge operation.

A merge, in Git, consists of doing two git diffs, more or less. To do two git diffs, we need three commits.¹ This lets Git run:

git diff --find-renames <hash-of-H> <hash-of-J>

to compare what's in the snapshot for commit H, vs what's in the snapshot for commit J. That comparison tells Git what we changed, in branch1.

Git then repeats this but with commit L:

git diff --find-renames <hash-of-H> <hash-of-L>

to compare what's in H vs what's in L. That comparison tells Git what they changed, in branch2.

If we now combine the two sets of changes, and apply the combined changes to the snapshot from H—not the one from K, not the one from L, but the one from H—this will add together our changes and their changes. The resulting combined changes, applied to snapshot H, gets us a snapshot that keeps our changes but also adds their changes. Or, if you prefer, it keeps their changes but also adds our changes. The effect is the same either way, as long as there are no conflicts.

So, if all goes well here, Git keeps our changes and adds their changes, or adds our changes and their changes and applies those to the base, or however you wish to view it. The result is a new snapshot, ready to go into a new commit. Git makes this new commit on its own, and calls it a merge commit. Git remembers that it is a merge commit by making a commit with two parents, instead of the usual one:

          I--J
         /    \
...--G--H      M   <-- branch1 (HEAD)
         \    /
          K--L   <-- branch2

That's a normal, non-conflicted merge. As with any commit, Git writes the new commit's hash ID into the current branch name, so that branch1 now selects new merge commit M. Like all commits, M has a snapshot: the snapshot is the result of applying both sets of changes, after using git diff twice to find the two sets of changes. Unlike most commits, M has two parents instead of the usual one. This means that when Git goes to look at the history, by working backwards, it has to work backwards across both "forks" here.²

¹Think of it this way: git diff always needs two commits. If we use the same two commits—if we run git diff J L for instance—we just get the same diff. So to get two different diff outputs, we need at least three commits. We could use four, e.g., git diff I J and git diff K L, but that wouldn't actually help us get to our goal. We want to use git diff H J and git diff H L, using H twice, hence we need three commits.

²The word fork here is meant to imply that there's actually something similar going on with GitHub forks. These are not the same thing, but since Git works backwards, if we have a history with a merge in it, Git will see the merge as a fork in the road. (And, as you may have heard, "When you come to a fork in the road, take it.") With a GitHub fork, the original dividing—H forking to I and K—happens as people make new commits. The merging, if any, happens later.

The more curious thing, though, is that since Git works backwards, what we think of as a fork, Git sees as a coming-together. These form merge base points. What we see as a merge, Git sees as a fork!

Handling conflicts

The usual merge conflicts occur when we and they both change the same lines of the same file, but in a different way:

          I--J   <-- branch1 (HEAD)
         /
...--G--H
         \
          K--L   <-- branch2

Suppose that in file F in commit H, there is a typo in the wrong word. We fix the typo, and they replace the word (or vice versa). When Git goes to merge our changes-to-file-F with their changes-to-file-F, Git will declare a merge conflict and leave us to fix up the mess.

We can do this by hand. We open the resulting work-tree file—which has both sets of changes in it, surrounded by conflict markers—and see that they fixed the typo by fixing the wrong word, so we keep their change and throw ours out by deleting our line and the conflict markers, leaving just their line. Or we can use a merge tool, which will generally show us all three files—the F from the merge base commit H, the F from our commit, and the F from their commit. The exact method by which a merge tool does this depends heavily on the tool; we won't worry about this and will just assume that we get the right result.

Alternatively, we can use -X ours or -X theirs to pick our changes or their changes and ignore the other "side"'s changes. The drawback here is that we have to know whose change is right: we pick the -X option at the time we run git merge, before we see the conflicts themselves. If, sometimes, our change is better, -X theirs won't work. If their change is sometimes better, -X ours won't work either. Sometimes you might be sure that their change, or your change, is always going to be better; that's where the -X option helps.³ If you cannot be absolutely sure, I recommend avoiding -X and just resolving conflicts yourself.

³Remember that -X tends to be "backwards" during a rebase, though: -X theirs means our original commit and -X ours means ... something complicated. See What is the precise meaning of "ours" and "theirs" in git?

High-level conflicts

In talking about conflicts above, we were looking at specific lines of one particular file. But those are not the only kinds of conflicts we can get. Suppose that, in H, there is no file named F. Instead, we write our own file F from scratch. It has stuff meant for one situation. They, meanwhile, write their own file F from scratch, and it has stuff meant for some other situation entirely. It's not correct to pick just our F, and it's also obviously not correct to pick their F. Perhaps, for instance, we should rename our F to something else, so that we can just store both files. Or perhaps it makes sense to combine their file with our file.

The key as far as Git is concerned, though, is that there was no file F in commit H at all. Git calls this an add/add conflict, and it means Git cannot resolve it on its own, not even with -X. I like to call these high level conflicts because they get generated in a part of Git that happens before doing a low-level file merge. (The low-level file merge—at least the kickoff for it—is in ll-merge.c, where ll stands for Low Level.) Others like to call this a tree conflict as the parts of Git involved in finding it are looking at the file tree structures inside Git commits.

There are other conflicts that hit this same high (or tree) level code. That includes if you delete a file, and they modify it, or vice versa: that is, there's some file F in H, and it's in one but not both of J and L but its contents in whichever of those two commits has it have changed. That means either we deleted F entirely, or they deleted F entirely. Whoever didn't delete F fixed / changed something in it. This is a modify/delete conflict, and as with the add/add conflict, Git will always stop with a conflict here.

The cherry-pick merge

When we (humans) run git cherry-pick, we generally want to copy a commit. That is, suppose we have this series of commits on some branch:

...--o--o--P--C--o--...--tip   <-- branch1

There is some child commit C with parent P. If we have Git run:

git diff <hash-of-P> <hash-of-C>

(or just git show hash-of-C, which includes this diff) we'll see what changed between the snapshot in P and the snapshot in C.

Meanwhile, we're on some other commit, perhaps on some other branch entirely:

...--H--I--J   <-- branch2 (HEAD)

We have discovered that the difference between P and C is just what we need to add, after our commit J, to make a new commit C' or K or whatever we choose to call it. This will be a copy of C, as it were. We would, in other words, like to have Git run the git diff P C to figure out what changed, then find the same code in our commit J and make that same change.

If all goes well, we'll get our commit C'. Git will even copy the commit message from commit C, so that git show on our new commit will have the same log message. The difference between snapshot J and snapshot C' will be the same as the difference between P and C, except maybe for line numbers. So we'll call this new commit C' to show just how close it is to C.

But: how should Git know which lines match up? Maybe, in the P-vs-C, the change is really close to the top of the file, but in our version of the file in J, there is a bunch of new stuff at the top of the file. Or, maybe in P-vs-C, the change is way down in the middle of the file, but in J we don't have all that extra stuff, and it's up close to the top of the file.

What we need Git to do, then, is run git diff P J. That will tell Git what's different from P to J, so that it can line up the P-vs-C changes.

But if we are going to have Git diff P vs J, and P vs C ... that sounds a whole lot like git merge, doesn't it? Suppose we have Git do these two diffs, and then treat commit P as a merge base commit. To the snapshot in P, Git will add all "our" changes in J, so that we keep our changes. To the snapshot in P, Git will add all "their" changes in C, so that we gain their changes. That will give us the right combination, so that we keep what we had, but add their P-vs-C changes.

So this is what git cherry-pick does. It treats their parent commit P as the merge base, J—our HEAD—as our commit, and C—their child commit—as their commit. All the -X options work as before, with -X ours meaning P vs J and -X theirs meaning P vs C.

Now, in your commits, you have deleted some files entirely. They haven't. So P-vs-J will say delete this swath of files. In their commit C, they may have changed one of their files. This is a modify/delete conflict, with your "side" of the merge having deleted the file, and their side modifying the file. Since this is a high-level / tree-level conflict, Git will stop, regardless of any -X options. You will have to resolve this conflict, by confirming that you want the file deleted.

What you can do to make your life easier

You could write your own script, to be run in the case that git cherry-pick (or git rebase's cherry-pick) produces a merge conflict that includes modify/delete conflicts. You can find these conflicts by inspecting Git's index. See git ls-files --stage output—note that it is very long—and look for cases where there is a file in stage 1 (the merge base, i.e., their P commit) and stage 3 (their C) but absent in stage 2 (HEAD, i.e., your commit J or equivalent). Resolve the conflict by deleting the two entries in stages 1 and 3. You can do this programmatically using a list of files you know you deliberately deleted, for instance. After that, git status will tell you if there are any other conflicts remaining.

Annoyingly, there is no way to make Git run this script automatically on conflicts. However, if you have the script detect whether any conflicts remain and whether git status tells you that you're in the middle of a rebase, you can have it run git rebase --continue appropriately, which at least reduces everything to a single shell command.

TTT · Answer 2 · 2021-04-06T20:07:37.063

If you assume the answer is No, then you are left with what should you do instead?

I think the suggestion at the bottom of the answer by torek is probably the most versatile approach, as you have the flexibility to control the conflict resolution however you'd like inside of your custom script. I have done something similar before with good results on a known list of files that were expected to sometimes have conflicts.

That being said, because you are rebasing instead of merging, your willingness to rewrite your existing commits makes me think you could try a fairly simple approach. You mentioned you didn't want to delete the files again every time, but what if you automated the delete? In other words, write a script to simply delete all the files you don't want, and at the end of the script it can commit the change. Your new workflow would be:

Checkout the commit from the latest copy of the repo you want.
Run your script to delete the files and make a new commit with that change.
Use git rebase --onto to replay the range of commits from your old copy of the branch (starting after the commit that deletes the files) onto the new commit in step #2.

All of these steps should be straight forward to automate so you can just run a single script any time you want to update your branch. In this way there shouldn't be any conflict resolution needed at all, at least not for any of the deleted files.