2

TL;DR: There is a phrase in our git repository that must be removed from history, not just the heads of branches. What other ways are there besides removing it from the head of develop and making a new repository? We want to maintain as much history as possible.

Background

For icky legal reasons, my team and I have to remove all instances of a word from our code base (let's call it Voldemort just for fun and relevance). The annoying thing is that we don't just have to remove Voldemort from the tips of the branches, we have to remove it from each commit in our repositories (the lawsuit is something along the lines of "no developer should be reasonably able to revert to a state where Voldemort was in the code").

We're not using Voldemort anymore, but there are a few places in the code where it's still mentioned like comments. (Yes, as part of a law suit we have to remove infringing comments from our code.)

The original plan was to purge the word that must not be mentioned and then make a new repository and push the current state as the initial commit. We don't want to lose all our history1 though! So we want to know if there's a way to avoid that.

So, the question is how do we remove Voldemort, the word which must not be mentioned, from the history while maintaining as much of the history1 as possible? Also, what can we do to make sure it's not in any commit? We want to know how to check our work to make sure it's gone.

1: By history I don't mean the specific commits, I just mean being able to look at the history of a file and know who did what, it's okay to me if the history is gone as in "rewriting history" in the git sense, I'm actually guessing it's the only approach.

Information on the state of the repo

  • Currently develop branch is Voldemort-free, but we have "meaningful" commits before and after the purging commits
  • Probably only the initial commit has anything adding lines with Voldemort (because we migrated from SVN to git and Voldemort was added ages ago)
  • Probably the only commits modifying any files with Voldemort are the ones that removed it (like I said, it's pretty old stuff)

Guesses for an approach

Seems like we'd want to do something like git log --patch | grep 'Voldemort' to find commits that add Voldemort then do an interactive rebase of everything editing the commits where Voldemort was added to add some other thing or nothing at all.

Captain Man
  • 6,997
  • 6
  • 48
  • 74
  • 1
    The command you need here is in fact `git filter-branch` (see existing answer). It's a bit tricky to use because it has so many options and modes; it has so many options and modes because it's so very slow in its most fundamental mode. What filter-branch does, in a most basic sense, is simply to *copy* every "to be filtered" commit to a *new* commit, which may be the same as the original or may be different; build a map from old ID to new ID; and then rewrite references to use the new IDs. – torek Jun 30 '16 at 19:28

3 Answers3

4

Use the BFG Repo Cleaner, which is both faster and easier to use than git filter-branch.

To replace all occurrences of Voldemort, in all files, with the text *** REMOVED ***, you can simply:

% echo 'Voldemort' > badwords.txt
% bfg --replace-text badwords.txt myrepo.git
Edward Thomson
  • 74,857
  • 14
  • 158
  • 187
  • In hindsight I should have done the `bfg` cleaner in place of the `tree-filter` at least (even with my speed up it still took over an hour I think). Can `bfg` also modify messages and move tags? (My team has to remove a second set of bad words and this approach seems a lot simpler, I don't know why I was so hesitant to try it before.) – Captain Man Nov 08 '16 at 20:08
  • 1
    I can't stress how much easier BFG Repo Cleaner is to solve these kinds of issues than any other tool I've used. Also remember that your remote host may not purge history (Github and Azure DevOps don't for exmple), so you may need to work witdh their respective support department to get the history expunged. – jessehouwing Apr 15 '21 at 14:52
2

Check out git filter-branch here.

Captain Man
  • 6,997
  • 6
  • 48
  • 74
Ewan Mellor
  • 6,747
  • 1
  • 24
  • 39
2

I thank Ewan Mellor for pointing me in the right direction, but the answer is very small and I think this needs more detail.


Reminder

If you do a fresh clone of the repo before doing this make sure you have local branches of all the remotes (e.g., git checkout master; git checkout develop; git checkout feature/some-undone-feature etc.).


What we did

> git filter-branch --tree-filter "~/purge.sh" \
                    --msg-filter "sed -e 's/voldemort/<word removed due to lawsuit>/gI'" \
                    --tag-name-filter "cat" \
                    -- --all

The purge script (probably could be one line, but it's cleaner like this):

#!/bin/bash

files=$(grep -rli 'voldemort')

for file in ${files}; do
    sed -i -e 's/voldemort/<word removed due to lawsuit>/gI' ${file}
done

Next steps

Now that you're done, you will want to check these questions:

  1. Remove refs/original/heads/master from git repo after filter-branch --tree-filter? : This will show you how to remove the back up that git filter-branch makes.
  2. Listing and deleting Git commits that are under no branch (dangling?) : This will make sure you have no bad words laying in your local repo. This is needed in our case because if the bad word is on our laptop the company may get sued and/or they may perform a remote wipe if they find Voldemort software. You may want to run this on your remote repo, but if you cannot then maybe just make a new one (with a slightly different name or URL to make sure no one pushes to it by mistake or merges, undoing all your hard work!).

Explanation

  • --tree-filter "~/purge.sh"
    • for each commit, run the ~/purge.sh script against the working tree (--tree-filter ...)
      • make a list of files containing voldemort (grep ... 'voldemort')
      • recursively from here, listing the name (not the content), and without regard to case (-rli)
      • for each file in the list (for file in ${files}; do)
        • replace each instance of the word phrase voldemort with <word removed due to lawsuit> in that file (sed ... -e s/.../.../ ${file})
        • in place with no backup (-i)
  • --msg-filter "sed -e 's/voldemort/<word removed due to lawsuit>/gI'"
    • Replaces each instance of the word phrase voldemort with <word removed due to lawsuit> (sed -e s/.../.../)
    • even if there are two on a line and without regard to case (/gI)
    • in commit messages --msg-filter ...
  • --tag-name-filter "cat"
    • for each tag, rename it as its old name on the new commit (if this isn't present tags won't carry over
  • -- --all
    • do this for every commit in the repository (yes, that is two dashes followed by a space then another two dashes)

Note about performance

You may be wondering why we did not simply do sed -i -e 's/voldemort/<word removed due to lawsuit>/gI' on each file in --tree-filter. The reason is because this is a lot slower. I think because it is rewriting each file... in each commit... even if the word that must not be named is not in the file. It sped up the process a lot (at least 10x, maybe 100x, didn't want to wait for the first way to finish) to get a list of problem files by grep -rli 'voldemort' first. (However, I have reason to believe antivirus software or something else on our laptops made git incredibly slow, so your mileage may vary.)

Captain Man
  • 6,997
  • 6
  • 48
  • 74