Consider using git replace
I am not going to put this into the answer since git replace
literally does not remove the two commits, but it's probably the right solution. It lets you pretend they are gone, and is transferrable to other repositories, and does not renumber every copied commit. In any case there are existing SO answers that cover this, e.g., How do git grafts and replace differ? (Are grafts now deprecated?)
To understand the reason you might choose git replace
, and why rebase is probably wrong, read on.
Not rebase but filter-branch
While git rebase
is fine for smaller cases, in your case, the "D > [thousands of commits] > HEAD" part is problematic.
The reason is that rebase normally just strips merges entirely. Presumably there are branch-and-merge sequences in the "thousands of commits" section.
Rebase has a -p
or --preserve-merges
flag, but this does not, in a strict sense, preserve the merges. Instead, it re-performs the merges. Because of the nature of a rebase, this is quite necessary for some cases—but since the case you're dealing with is more specific, it's not necessary for your particular problem. Attempting to re-perform the merges is likely to be disastrous.
What this means is that you probably don't want rebase after all. You may want git filter-branch
.
You can't quite get what you want
Any operation that results is the removal of the two "bad" commits B
and C
is going to mean that Git will have to copy original commit D
to an altered commit D'
. The new commit, D'
, will store the same source tree as commit D
. It can (and should) have the same author and committer and their time stamps. It can and should have the same commit message, as well. But it will have, as its parent commit, commit A
instead of commit C
.
This means that the new commit D'
will have at least one thing changed, compared to the original D'
. It will therefore have a different SHA-1 ID.
Now, in the "[thousands of commits]" section, let's say the next commit is E
. You'd like to preserve E
as much as possible—but commit E
lists commit D
as its parent, by D
's SHA-1 ID.
We had to copy D
to D'
so that we could change D
's parent. The new copy, D'
, has a different SHA-1 ID. This means we are now forced to copy E
to E'
. E'
is exactly like E
was, except that as its parent, it lists commit D'
.
Copying E
to E'
forces us to copy whichever commit comes after E
(such as F
) to F'
. That forces us to copy its subsequent commits, which continues on all the way down to the tips of every branch that eventually works its way back to D
(now D'
).
git filter-branch
This is what git filter-branch
does: for every commit you tell it to examine, it extracts that commit, applies each of your filters, then makes a new commit that is as exact a copy as possible (but no more exact than that). If you manage, through your filters, to make a bit-for-bit identical commit—e.g., if you apply your filters to commit A
in the "A comes before B" part of the chain, and wind up not changing anything about A
—then the new commit has the same ID as the original commit, i.e., is the original commit. Otherwise—if any data in the commit have changed, whether that's a parent ID, or a tree ID, or a single bit in the commit message—you get a new, different commit: an A'
, as it were.
While filter-branch
is making all of these copies, it writes a map file that, for each copied commit, says "old commit ID X
becomes new commit ID X'
". And, filter-branch
allows you to skip a commit on purpose.1
Hence, what you want to do here is git filter-branch
with a --commit-filter
or --parent-filter
.
If using a commit filter, you will simply skip commits B
and C
. We could start with this example straight out of the documentation:
To remove commits authored by "Darl McBribe" from the history:
git filter-branch --commit-filter '
if [ "$GIT_AUTHOR_NAME" = "Darl McBribe" ];
then
skip_commit "$@";
else
git commit-tree "$@";
fi' HEAD
We would then need at least two changes, and probably a third:
Instead of filtering HEAD
, we want to filter --all
(spelled -- --all
since we need to pass --all
to git rev-list
rather than having git filter-branch
try to interpret it).
Instead of testing the commit's author for a specific name, we want to test the commit's ID to see if it's either the ID of commit B
, which we want to skip, or the ID of commit C
, which we also want to skip.
We (probably) want to be sure to update any tags that used to point to old commits, so that they point to the new copies instead. This means we need a --tag-name-filter
, and the filter we want for tag names is just cat
(i.e., pass the original tag name through unchanged).
Since I don't have the raw commit IDs for B
and C
I cannot show them here, but in the end it works out to:
git filter-branch --commit-filter '
if [ $GIT_COMMIT = id-of-B -o $GIT_COMMIT = id-of-C ];
then
skip_commit "$@";
else
git commit-tree "$@";
fi' HEAD
Using the --parent-filter
is a bit simpler. Again, straight from the documentation, we have:
git filter-branch --parent-filter \
'test $GIT_COMMIT = <commit-id> && echo "-p <graft-id>" || cat' HEAD
or even simpler:
echo "$commit-id $graft-id" >> .git/info/grafts
git filter-branch $graft-id..HEAD
Here $commit-id
effectively stands in for commit D
, and $graft-id
stands in for commit A
.
Again, in our case, we don't really want HEAD
, we want -- --all
. We probably also want --tag-name-filter cat
. We can use a negative reference such as ^<id-of-C>
to skip copying commit-C
-and-earlier (this is what the left side of $graft-id..HEAD
does). Since copying commits is slow, this will skip the thousands of commits that come before C
, and hence speed up the filtering quite a bit.
Note that grafts are not very stable: they were replaced with git replace
, which is considerably more robust. If you do use a graft like this, you should almost certainly immediately run git filter-branch
to make the graft permanent. (You can also use git replace
to run git filter-branch
to make the replacement permanent.)
1When you skip a commit, it writes into the map file an entry that tells it that the old commit ID is gone. More precisely, it maps the old ID to the "closest ancestor new ID". See the remap to ancestor section of the filter-branch
documentation. In this case, presumably you have no references pointing to commits B
or C
—the two that you will skip—so this is just an interesting theoretical note. If you did have a reference pointing to either B
or C
, though, the effect of stripping them out is to rewrite filtered positive references to point to A
. (Note that they must be mentioned in the filter-branch
reference expressions, usually via --all
.)
Consequences of copying commits
The drawback to any of the methods that really does permanently remove commits B
and C
is that it renumbers (re-hashes) every commit after the removed ones. All of Git's distributed-repository magic works through these hashes. This means that once you have rewritten history, every user with a clone or fork must take action on their part to adapt to the new, rewritten history. (Typically this means "save current work / repo to one side, re-clone, then cherry-pick or otherwise re-extract current work into new clone.")
It does not matter how you get the altered history, whether that's git rebase
or git filter-branch
or using something like BFG. "Changing the past" renumbers these cryptographically-signed Merkle tree IDs. Everyone else who is using those now must adapt.
When using git replace
, what happens is that we leave B
and C
in place, and tell Git that when looking at commit D
, it should instead look at some altered copy D'
. The altered version D'
is just a copy of D
with its parent set to A
, which means that as long as Git does slide its little gitty eyes over to D'
instead, it will "see" the chain going from commit E
to D'
and then to A
, and not "see" the original "D
leads to C
leads to B
leads to A
" sequence.
The replacement method also requires all clients to deliberately accept this new replacement D'
(they won't just see it automatically), so as with filter-branch
it's not perfect. It is, however, much less disruptive: clients can start (or stop!) replacing at any time with no effect on what they are doing now. Only clients who are viewing the replacement will see the altered history.