33

I have a git repo with some very large binaries in it. I no longer need them, and I don't care about being able to checkout the files from earlier commits. So, to reduce the repo size, I want to delete the binaries from the history altogether.

After a web search, I concluded that my best (only?) option is to use git-filter-branch:

git filter-branch --index-filter 'git rm --cached --ignore-unmatch big_1.zip big_2.zip etc.zip' HEAD

Does this seem like a good approach so far?

Assuming the answer is yes, I have another problem to contend with. The git manual has this warning:

WARNING! The rewritten history will have different object names for all the objects and will not converge with the original branch. You will not be able to easily push and distribute the rewritten branch on top of the original branch. Please do not use this command if you do not know the full implications, and avoid using it anyway, if a simple single commit would suffice to fix your problem. (See the "RECOVERING FROM UPSTREAM REBASE" section in git-rebase(1) for further information about rewriting published history.)

We have a remote repo on our server. Each developer pushes to and pulls from it. Based on the warning above (and my understanding of how git-filter-branch works), I don't think I'll be able to run git-filter-branch on my local copy and then push the changes.

So, I'm tentatively planning to go through the following steps:

  1. Tell all my developers to commit, push, and stop working for a bit.
  2. Log into the server and run the filter on the central repo.
  3. Have everyone delete their old copies and clone again from the server.

Does this sound right? Is this the best solution?

Roberto Tyley
  • 24,513
  • 11
  • 72
  • 101
rlkw1024
  • 6,455
  • 1
  • 36
  • 65
  • 2
    It occurs to me now that the *easiest* thing to do might be to have your developers each run the identical `git-filter-branch` command. They should end up with histories identical to what you produced without having to re-clone or manually rebase. – Ben Jackson Dec 16 '10 at 17:29
  • 1
    @BenJackson the code files would be identical, but the commit objects will have different committer metadata added by the rebase. – Douglas Jan 16 '12 at 11:17
  • 2
    @Douglas I don't think that `git filter-branch` alters committer data unless you explicitly ask it to. (`git commit --rebase` does, but not `git filter-branch`, as far as I can see.) – cdhowie Aug 21 '12 at 20:51
  • @cdhowie actually I think it does, the commits are actually rewritten with entirely new commit hashes, so the tree structure that you get at the end of the command is a new tree, it's not the same commit tree that you had before, it's been rebuilt. – Joseph Oct 29 '12 at 13:29
  • @Joseph Yes, but it does not modify the "committer" field, which is what I was talking about. – cdhowie Oct 29 '12 at 20:04
  • git filter-branch --index-filter 'git rm --cached --ignore-unmatch *.zip' HEAD works for me, I don't remember the file name – wukong May 22 '13 at 21:01
  • Related: [How to remove/delete a large file from commit history in Git repository?](http://stackoverflow.com/q/2100907/123109) – Greg Bacon Jan 13 '15 at 20:58

4 Answers4

19

Yes, your solution will work. You also have another option: instead of doing this on the central repo, run the filter on your clone and then push it back with git push --force --all. This will force the server to accept the new branches from your repository. This replaces step 2 only; the other steps will be the same.

If your developers are pretty Git-savvy, then they might not have to delete their old copies; for example, they could fetch the new remotes and rebase their topic branches as appropriate.

cdhowie
  • 158,093
  • 24
  • 286
  • 300
  • This doesn't consider all cases. If there are tags or other branches you should all `--tag-name-filter cat` and `-- --all` instead of HEAD to the git filter-branch options. See my answer for more info. – Jason Axelson Jul 16 '13 at 21:45
13

Your plan is good (though it would be better to perform the filtering on a bare clone of your repository, rather than on the central server), but in preference to git-filter-branch you should use my BFG Repo-Cleaner, a faster, simpler alternative to git-filter-branch designed specifically for removing large files from Git repos.

Download the Java jar (requires Java 6 or above) and run this command:

$ java -jar bfg.jar  --strip-blobs-bigger-than 1MB  my-repo.git

Any blob over 1MB in size (that isn't in your latest commit) will be totally removed from your repository's history. You can then use git gc to clean away the dead data:

$ git gc --prune=now --aggressive

The BFG is typically 10-50x faster than running git-filter-branch and the options are tailored around these two common use-cases:

  • Removing Crazy Big Files
  • Removing Passwords, Credentials & other Private data
artbristol
  • 32,010
  • 5
  • 70
  • 103
Roberto Tyley
  • 24,513
  • 11
  • 72
  • 101
5

If you don't make your developers re-clone it's likely that they will manage to drag the large files back in. For example, if they carefully splice onto the new history you will create and then happen to git merge from a local project branch that was not rebased, the parents of the merge commit will include the project branch which ultimately points at the entire history you erased with git filter-branch.

Ben Jackson
  • 90,079
  • 9
  • 98
  • 150
  • So in other words, my plan to have everyone re-clone will avoid a lot of potential gotchas? – rlkw1024 Dec 15 '10 at 22:00
  • 1
    For you and the repository. It will be annoying for anyone with a pre-existing collection of project branches and stashes. – Ben Jackson Dec 15 '10 at 23:49
3

Your solution is not complete. You should include --tag-name-filter cat as an argument to filter branch so that the tags that contain the large files are changed as well. You should also modify all refs instead of just HEAD since the commit could be in multiple branches.

Here is some better code:

git filter-branch --index-filter 'git rm --cached --ignore-unmatch big_1.zip big_2.zip etc.zip' --tag-name-filter cat -- --all

Github has a good guide: https://help.github.com/articles/remove-sensitive-data

Jason Axelson
  • 4,485
  • 4
  • 48
  • 56