Remove pair of old git commits

Question

Is it possible to just remove two old commits out of a git repository?

For example, take this timeline: [thousands of commits] > A > B > C > D > [thousands of commits] > HEAD

I want to remove "B" and "C", but without altering any of the history beginning with "D"

Some notes:

"A" and "C" are functionally identical
"B" is basically "delete every file in the repository"
"C" is basically "add every file in the repository"
There are no branches or alternate paths from A to D (this portion of our repository is all converted from another source control system which didn't support branching, so it's very linear for thousands of commits in either direction)
Our repository is now hosted on GitHub, and there are countless branches, pull requests, and local clones of this repository, all dating from thousands of commits after "D"

If this is possible to correct, I'd love to do it, just because it effectively breaks any "blame" functions by shadowing any commits prior to this. Less importantly, it also breaks many of GitHub's "graph" functions, since these two massive commits throw the scaling off by so much.

I've looked into reverting the two commits, but it doesn't really help any of the "blame" functions (it just moves the blame for every line from "C" to the new revert commit). It sounds like a rebase is what I'm looking for, but how will this impact any of the active work near the end of the branch?

score 1 · Answer 1 · answered Sep 30 '16 at 20:01

You can. How to do it depends on whether you have pushed these changes up to a remote.

Unpushed Changes

If the changes have not yet been pushed to a remote, you can simply rebase the good commits (D or its ancestors if it has any) on top of the good base commit (A), excluding any ancestors of the last bad commit (C):

git rebase --onto <commit-sha-of-A> <commit-sha-of-C> <commit-sha-of-D>

While on the offending branch, use --onto to tell it which branch or commit to rebase onto. The branch or commit of C is then referenced to tell it what ancestry to exclude. Finally, the branch or commit of D or its ancestors is referenced to tell it what ancestry to rebase.

You go from:

-> A -> B -> C -> D

to:

-> A -> D
   \--> B -> C

Pushed Changes

If you have already shared these changes, you'll rewrite the history of the branch and could cause extra work for your teammates. You'll want to notify people of the impending change. First, fix the issue on your local repo using the same method above. When you are ready, you will have to force up this divergence to the remote:

git push --force <remote> <branch>

Anyone who is affected by this change will encounter merge issues if they have made changes of their own. You'll want them to fetch the changes and rebase their good changes, if any, on top of the fixed branch using the same method above.

Hope this helps!

score 1 · Answer 2 · answered Sep 30 '16 at 20:07

As Matt Meng writes, you can use git rebase to remove commits from the history. This has the undesirable side effect of creating a completely new version history from commit C onward. If you are working on this project yourself, the side effects of this are minimal. If you are working on a team, rewriting the version history can cause serious problems because they will need to rebase their own work onto the new tree.

Alternatively, you can use git revert which will create new commits that "undo" the changes in the given commits.

score 0 · Answer 3 · edited May 23 '17 at 10:34

Consider using `git replace`

I am not going to put this into the answer since git replace literally does not remove the two commits, but it's probably the right solution. It lets you pretend they are gone, and is transferrable to other repositories, and does not renumber every copied commit. In any case there are existing SO answers that cover this, e.g., How do git grafts and replace differ? (Are grafts now deprecated?)

To understand the reason you might choose git replace, and why rebase is probably wrong, read on.

Not rebase but filter-branch

While git rebase is fine for smaller cases, in your case, the "D > [thousands of commits] > HEAD" part is problematic.

The reason is that rebase normally just strips merges entirely. Presumably there are branch-and-merge sequences in the "thousands of commits" section.

Rebase has a -p or --preserve-merges flag, but this does not, in a strict sense, preserve the merges. Instead, it re-performs the merges. Because of the nature of a rebase, this is quite necessary for some cases—but since the case you're dealing with is more specific, it's not necessary for your particular problem. Attempting to re-perform the merges is likely to be disastrous.

What this means is that you probably don't want rebase after all. You may want git filter-branch.

You can't quite get what you want

Any operation that results is the removal of the two "bad" commits B and C is going to mean that Git will have to copy original commit D to an altered commit D'. The new commit, D', will store the same source tree as commit D. It can (and should) have the same author and committer and their time stamps. It can and should have the same commit message, as well. But it will have, as its parent commit, commit A instead of commit C.

This means that the new commit D' will have at least one thing changed, compared to the original D'. It will therefore have a different SHA-1 ID.

Now, in the "[thousands of commits]" section, let's say the next commit is E. You'd like to preserve E as much as possible—but commit E lists commit D as its parent, by D's SHA-1 ID.

We had to copy D to D' so that we could change D's parent. The new copy, D', has a different SHA-1 ID. This means we are now forced to copy E to E'. E' is exactly like E was, except that as its parent, it lists commit D'.

Copying E to E' forces us to copy whichever commit comes after E (such as F) to F'. That forces us to copy its subsequent commits, which continues on all the way down to the tips of every branch that eventually works its way back to D (now D').

`git filter-branch`

This is what git filter-branch does: for every commit you tell it to examine, it extracts that commit, applies each of your filters, then makes a new commit that is as exact a copy as possible (but no more exact than that). If you manage, through your filters, to make a bit-for-bit identical commit—e.g., if you apply your filters to commit A in the "A comes before B" part of the chain, and wind up not changing anything about A—then the new commit has the same ID as the original commit, i.e., is the original commit. Otherwise—if any data in the commit have changed, whether that's a parent ID, or a tree ID, or a single bit in the commit message—you get a new, different commit: an A', as it were.

While filter-branch is making all of these copies, it writes a map file that, for each copied commit, says "old commit ID X becomes new commit ID X'". And, filter-branch allows you to skip a commit on purpose.¹

Hence, what you want to do here is git filter-branch with a --commit-filter or --parent-filter.

If using a commit filter, you will simply skip commits B and C. We could start with this example straight out of the documentation:

To remove commits authored by "Darl McBribe" from the history:

git filter-branch --commit-filter '
    if [ "$GIT_AUTHOR_NAME" = "Darl McBribe" ];
    then
            skip_commit "$@";
    else
            git commit-tree "$@";
    fi' HEAD

We would then need at least two changes, and probably a third:

Instead of filtering HEAD, we want to filter --all (spelled -- --all since we need to pass --all to git rev-list rather than having git filter-branch try to interpret it).
Instead of testing the commit's author for a specific name, we want to test the commit's ID to see if it's either the ID of commit B, which we want to skip, or the ID of commit C, which we also want to skip.
We (probably) want to be sure to update any tags that used to point to old commits, so that they point to the new copies instead. This means we need a --tag-name-filter, and the filter we want for tag names is just cat (i.e., pass the original tag name through unchanged).

Since I don't have the raw commit IDs for B and C I cannot show them here, but in the end it works out to:

git filter-branch --commit-filter '
    if [ $GIT_COMMIT = id-of-B -o $GIT_COMMIT = id-of-C ];
    then
            skip_commit "$@";
    else
            git commit-tree "$@";
    fi' HEAD

Using the --parent-filter is a bit simpler. Again, straight from the documentation, we have:

git filter-branch --parent-filter \
    'test $GIT_COMMIT = <commit-id> && echo "-p <graft-id>" || cat' HEAD

or even simpler:

echo "$commit-id $graft-id" >> .git/info/grafts
git filter-branch $graft-id..HEAD

Here $commit-id effectively stands in for commit D, and $graft-id stands in for commit A.

Again, in our case, we don't really want HEAD, we want -- --all. We probably also want --tag-name-filter cat. We can use a negative reference such as ^<id-of-C> to skip copying commit-C-and-earlier (this is what the left side of $graft-id..HEAD does). Since copying commits is slow, this will skip the thousands of commits that come before C, and hence speed up the filtering quite a bit.

Note that grafts are not very stable: they were replaced with git replace, which is considerably more robust. If you do use a graft like this, you should almost certainly immediately run git filter-branch to make the graft permanent. (You can also use git replace to run git filter-branch to make the replacement permanent.)

¹When you skip a commit, it writes into the map file an entry that tells it that the old commit ID is gone. More precisely, it maps the old ID to the "closest ancestor new ID". See the remap to ancestor section of the filter-branch documentation. In this case, presumably you have no references pointing to commits B or C—the two that you will skip—so this is just an interesting theoretical note. If you did have a reference pointing to either B or C, though, the effect of stripping them out is to rewrite filtered positive references to point to A. (Note that they must be mentioned in the filter-branch reference expressions, usually via --all.)

Consequences of copying commits

The drawback to any of the methods that really does permanently remove commits B and C is that it renumbers (re-hashes) every commit after the removed ones. All of Git's distributed-repository magic works through these hashes. This means that once you have rewritten history, every user with a clone or fork must take action on their part to adapt to the new, rewritten history. (Typically this means "save current work / repo to one side, re-clone, then cherry-pick or otherwise re-extract current work into new clone.")

It does not matter how you get the altered history, whether that's git rebase or git filter-branch or using something like BFG. "Changing the past" renumbers these cryptographically-signed Merkle tree IDs. Everyone else who is using those now must adapt.

When using git replace, what happens is that we leave B and C in place, and tell Git that when looking at commit D, it should instead look at some altered copy D'. The altered version D' is just a copy of D with its parent set to A, which means that as long as Git does slide its little gitty eyes over to D' instead, it will "see" the chain going from commit E to D' and then to A, and not "see" the original "D leads to C leads to B leads to A" sequence.

The replacement method also requires all clients to deliberately accept this new replacement D' (they won't just see it automatically), so as with filter-branch it's not perfect. It is, however, much less disruptive: clients can start (or stop!) replacing at any time with no effect on what they are doing now. Only clients who are viewing the replacement will see the altered history.