Git rewrite history, mitigating regression

Question

I want to remove a large number of unit test files from a git repository, and wipe them from the commit history, in order to save space. I understand that two main ways of doing so is to use git filter-branch or using the BFG repo cleaner (written by Roberto.)

1) Suppose the main repo has been cleaned and a team member has not deleted their old dirty version of the repo. Would the history still get dirty if they did a git pull --rebase, and then pushed to the main repo?

2) As above, suppose the main repo has been cleaned and a team member has not deleted their old dirty version of the repo. Suppose the team member pushes to the main repo. How would I be able to tell that the team member has pushed using their dirty version of the repo? (Would I have to just compare the commit hashes of the parent of that commit? My understanding is that cleaning the main repo whether by BFG or git filter-branch changes all of the hashes of all commits in the repo)

score 0 · Answer 1 · answered Apr 16 '18 at 21:40

1) Don't worry, their pull will fail. Because the history is different they need to fix the differences between the remote and their local repository. Deleting the local repository and doing a new clone, like you have mentioned, is one way to do this, but it risks losing valuable data - for example, in other branches or in .gitignored files.

A better way is to Git pull after forced update.

2) Again, don't worry - git will not let them push their copy which has a different history. They will need to use --force to do this, so they will be very conscious that they are overwriting other peoples' work.

If they were to use --force git won't stop them (just like it didn't stop you from using --force in the first place to change the history originally). However, you can configure your server to reject force pushes from certain people. If you are using a web service; GitHub, GitLab, BitBucket and more all have options to reject force pushes - go to the settings of the repository to configure this.

score 0 · Answer 2 · answered Apr 16 '18 at 21:50

The answer to both questions is yes: mixing the old (pre-cleaning) repository with the new (post-cleaning) repository results in the union of the two repositories. However, for question 2, a straight git push without doing a git fetch first (or a pull of any sort, which runs fetch as its first step), whoever is doing the git push will see a failure with a complaint from the receiving repository that the push is not a fast-forward. They will have to override this failure using the + or --force flag.

The git pull may or may not fail with a complaint about unrelated histories, depending on which copied commits (see the description below) wound up re-using the original commits. This also depends on specific Git versions, as older Gits would attempt to git merge unrelated histories without requiring the --allow-unrelated-histories option.

(Would I have to just compare the commit hashes of the parent of that commit? My understanding is that cleaning the main repo whether by BFG or git filter-branch changes all of the hashes of all commits in the repo)

This is sort of correct, but wrong in some important details.

Filtering (by any means) is actually the process of copying commits. We take all the original commits, with their parent hash IDs and trees and stored blobs, and copy each commit to a new commit. The new commit will have the filter(s) applied: we remove any blobs we want gone, or make any other changes we desire to the tree and/or to the commit metadata (user names, time stamps, messages, and so on). The first result is another tree, reusing some existing tree hash ID if we made no changes to the tree, or a new tree ID if the new tree is different from every existing tree. We put this old-or-new tree ID in our old-or-new commit metadata using the updated parent hash. The updated parent hash is new if any change has happened to any predecessor commit, or the same if not. Then we make the new commit: if it's 100% identical to the original commit, we get the original hash ID back, otherwise we get a new hash ID.

What this means is that as long as the new copies are 100% bit-for-bit identical to the originals, you are really just re-using the originals. But as soon as some commit is changed, even a tiny bit, the new copy is a different commit, and all of its children now have a different parent hash and are themselves also different commits.

The end effect is that after filtering a repository with git filter-branch, you generally have a doubled repository, minus whatever amount Git was able to re-use existing commits. The original branch heads are now find-able only through the refs/original/ namespace. If you used a --tag-name-filter cat, all tags are updated to use the new commits as well, so removing all the refs/original/ references eliminates the original commits.

The BFG avoids all this by rewriting the original references without keeping backups in refs/original/ (and is of course much faster and more convenient than git filter-branch). Nonetheless, it's still effectively copying all the original commits to new ones. Your copied repository is, in effect, a mostly-new repository, which should never be mixed with the old one.

Of course, if someone has some commits they want to bring from their own repository that is based on the old one, that person will have to mix the old and new repositories in some way. It's up to whoever does this mixing to be careful and to be certain not to reintroduce all the old commits.

For many users, under most circumstances, it suffices for them to treat the filtered repository as an entirely new project, cloning it anew and discarding their previous repository. Only those with commits to transplant need to understand all of the above.

Thanks for the detailed answer. Let me know if I am correct here: Suppose we have a repo with commits c1 <- c2<- c3 <- .... <-cn. Suppose we use BFG to remove a file which was added in c2. Then in the new filtered repo, c1 has the same commit hash and is identical to the old c1, but c2 through cn all have different hashes. Is this correct? — user55206, Apr 18 '18 at 01:24
@user55206: yes. This in turn implies that someone who rejoins the repositories has a Git that is extra-happy about this because the history in the new (filtered/fixed) repo starts with hash c1, and the history in the original (unfiltered/bad) repo *also* starts with hash c1, so obviously the old c2-onward and new replacement-for-c2-onward join up at c1 and whoever did that must mean to join them up there! :-) — torek, Apr 18 '18 at 14:24

Git rewrite history, mitigating regression

2 Answers2