How to prune a specific part of my repo's history to eliminate bloat

Question

I was trying to remove some sensitive info from some old commits on our company's Git repo using the techniques described on this GitHub help page. Using filter-branch, I was able to modify the repo's history to my liking.

Unfortunately, I made the mistake of doing a pull from origin and doing some further work on the repo. By doing this, I believe I've effectively merged the original 'tainted' repo (A) with my 'fixed' repo (B), since the number of commit objects has doubled from 3000 to 6000.

Now, I could run the filter-branch steps again and force-push to fix up what I have, but the repo is still 'bloated' to double its size.

I know roughly where the merge occurred, but not the precise commit. I would like to be able to identify and prove which commit is the culprit, and then permanently remove commit tree A. I have a few potential ideas about how it could be done...

modifying that specific commit that joins A with B and then running a prune to garbage-collect everything under it
by deleting that commit entirely from history and replicating it later, after a prune
rebasing to the last commit on the head of repo B and cherry-picking everything above it except the one where I merged with A (not sure if cherry-picking would pull the whole commit tree back in, though!)

I welcome all suggestions!

score 1 · Answer 1 · answered Oct 13 '17 at 18:55

modifying that specific commit that joins A with B

You literally cannot do this. But you can do something that may be just as good, or sufficiently good: you can make a copy of that commit, but before committing the copy, make it refer only to the B-side parent, not the A-side parent and the old history you wanted to remove.

Once you have copied that commit, though, you must also copy its immediate children. The new copies will be the same as the originals except that they refer to the copy, not the original.

Of course, having copied those children, you must now copy their children. The new copies will refer to the other new copies. This repeats all the way through time until you reach the most recent commits.

Basically, then, what you need to do is run git filter-branch again. The filter this time is: When you reach the specific commit that joins A with B, make a copy that doesn't do that. All other commits get copied "as is". The filter-branch command knows to substitute in the new parents from the first change onward. When copying earlier commits (those in side A, and those in side B that come before this mistake), the "copies" will be bit-for-bit identical to the originals, so filter-branch will wind up re-using the originals.

The end result will be as if you had changed that one specific commit, except that it and all its descendants will be new commits. You can then clone this repository to a new clone that doesn't refer at all to the side-A commits, and they will simply not be copied through; or you can, as you suggested, prune them away (but this is surprisingly difficult as Git desperately wants to avoid losing work, i.e., commits). In any case, once that is one, you must convince all users of the repository to abandon their previous clones and switch to this new re-shrunken repository.

The remaining issue at this point is how you convince Git to change the parentage of that one specific commit. There are two easy(ish) ways to do this:

use a "parent filter": see the filter-branch documentation
use git replace to construct the replacement commit first, then use git filter-branch to do the repository copying using the replacement, then discard the replacement since it's now incorporated into the copied commits.

The latter is easier to get right, since if you goof it up you can simply remove the replacement. However, if you understand all of this, the former is not that hard to get right either, for a single commit: just write a shell script fragment of the form:

[ $GIT_COMMIT = <hash> ] && echo "-p <B-parent-hash>" || cat

to use as your --parent-filter.

Thanks for the detailed answer! I shall look into this at work next week and see if can accept it. — BoffinBrain, Oct 15 '17 at 01:58
I'm sure your answer will be useful to someone out there, although it turns out I didn't need to do anything too complex to solve my problem. I've explained the outcome in a new answer. — BoffinBrain, Oct 19 '17 at 11:46
Ah, yes, your particular case allows you to repeat the original filtering. I should perhaps have noticed, since "remove sensitive data" tends to fall into this special case. — torek, Oct 19 '17 at 14:48

score 0 · Accepted Answer · answered Oct 19 '17 at 11:44

By shear luck, I believe that simply repeating my original actions has solved the problem, i.e. running filter-branch again on my repo has cleaned up the 'duplicate' commits.

Since my filtering process was simply to remove specific files from every commit, running the same filter again on my modified repo (B) has no effect (B' = B) whereas running it on the commits in repo A results in commits that are identical to B.

Since commit hashes are calculated by the contents of the changes and the hashes of its ancestors, and because the ancestors in A and B are now effectively identical, I end up with identical commit hashes on both sides of the tree... therefore the duplicates magically disappear! My new repo now contains just over 3000 commit object as before.

How to prune a specific part of my repo's history to eliminate bloat

2 Answers2