3

I recently have cloned an SVN repository which used to have a few binaries in it, which are not needed any longer. Unfortunately, I have already pushed it to Github with the binaries inlcuded. I now want to remove these using 'git filter-branch' but I am facing some problems when it comes to tags and branches.

Basically, I have created a simple shell script to remove a list of files which have been determined by the following command:

git rev-list --objects --all | grep .jar > files.txt

The script for removal looks like the following:

#!/bin/sh
while read file_hash file_to_remove
do
    echo "Removing "$file_to_remove;
    git filter-branch --index-filter "git rm --cached --ignore-unmatch $file_to_remove"
    rm -rf .git/refs/original/;
    git reflog expire --all --expire-unreachable=0;
    git repack -A -d;
    git prune
done < $1

I have a few tags (all listed in .git/packed-refs), one .git/refs/remotes/origin (pointing to the Github repo). The removal of the files using the above script does not have the wanted effect ('du -cm' remains to output the same size; 'git rev-list' still listing the files) until I manually remove all references from .git/packed-refs and the .git/refs/remotes/origin directory.

Naturally, I am losing all tags as well as the possibility to push my local changes back to Github with this approach. Is there anything I have missed or is there an alternative way for removing files from all branches/tags without destroying my history?

Many thanks in advance, Matthes

Roberto Tyley
  • 24,513
  • 11
  • 72
  • 101
matthes
  • 2,972
  • 3
  • 15
  • 18

1 Answers1

7

I ended up using the BFG Repo Cleaner on a bare cloned repository (git clone --mirror repo-url). It goes through every branch/tag, leaving each working and it is even much faster than filter-branch. Hope this helps other people having similar issues.

Here is my wrapper script:

#!/bin/bash
#usage: ./remove_files.sh file_list.txt bare-repo-dir
while read file_hash file_to_remove
do
    echo "Removing "$file_to_remove;
    lastFile=`echo $file_to_remove | awk -F/ '{print $NF}'`;
    java -jar bfg.jar --delete-files $lastFile $2;
done < $1

cd $2;
git gc --prune=now --aggressive;
cd ..;
matthes
  • 2,972
  • 3
  • 15
  • 18
  • 1
    Very glad you like the tool @matthes! Out of interest, how many different files did you need to remove? The "--delete-files" switch accepts glob expressions, and in general it's better to do just one big run of The BFG. For instance: '--delete-files *.{xml,exe}' – Roberto Tyley Apr 15 '13 at 14:55
  • @Roberto: good hint. indeed, I only removed (a huge list of) .jar files from the repo in the end. So I guess doing via "--delete-files *.jar" would have been even faster (and safer as well?) – matthes Apr 16 '13 at 07:30
  • Yup, "--delete-files *.jar" would do the trick! (or alternatively something like "--strip-blobs-bigger-than 512K"). The BFG also updates all the commit ids it finds in your commit messages, so it's nice to do that only once. Whichever approach you take, the BFG makes sure it doesn't delete anything in your latest commit, so any jars you're still using won't be removed. – Roberto Tyley Apr 16 '13 at 08:08