git filter-branch doesn't remove files

Question

I have accidentially left a databasebackup inside the tree, causing my bitbucket repository to go full.

Bitbucket say "7.26 GB of 2 GB".

On the webserver the entire folder is 6.2G, .git is 5.6G. leaving 600M as actual current files.

I follow the https://support.atlassian.com/bitbucket-cloud/docs/maintain-a-git-repository/ instructions.

I'm using the git shell on windows.

$ du -sh .git
5.6G    .git

$ ./git_find_big.sh
All sizes are in kB's. The pack column is the size of the object, compressed, inside the pack file.
size     pack     SHA                                       location
3690053  3611690  0a0bfa9facc2aea79ebbfaf9ce6221a0b093a115  dbbak/DATABASE_shop.zip
1633941  206040   7599e51f805d2a5a58ef85cc3111ff97b96c7f8c  dbbak/DATABASE_shop.bak

$ git filter-branch --index-filter 'git rm -r --cached --ignore-unmatch dbbak' HEAD
Rewrite 0d26f893e5159bafa22637efb67ad15441c363c2 (16/21) (8 seconds passed, remaining 2 predicted)    rm 'dbbak/DATABASE_shop.bak'
rm 'dbbak/DATABASE_shop.zip'
Rewrite de5bf4e33b2ed8a735d5a310f677134e116c6935 (16/21) (8 seconds passed, remaining 2 predicted)    rm 'dbbak/DATABASE_shop.zip'

Ref 'refs/heads/master' was rewritten

$ du -sh .git
5.6G    .git  <-- still same amount used

$ git for-each-ref --format="%(refname)" refs/original/ | xargs -n 1 git update-ref -d
(no output)
$ git reflog expire --expire=now --all
(no output)
$ git gc --prune=now
Enumerating objects: 16861, done.
Counting objects: 100% (16861/16861), done.
Delta compression using up to 8 threads
Compressing objects: 100% (9896/9896), done.
Writing objects: 100% (16861/16861), done.
Total 16861 (delta 6373), reused 16843 (delta 6367), pack-reused 0

$ du -sh .git
5.6G    .git <-- Still same amount used 

$ git push --all --force
$ git push --tags --force 
// Doesn't alter the space used. I didn't expect it to.

If I re-run ./git_find_big.sh the big files are still there :-(

If I clone from bitbucket to a new folder, the entire folder is 1.3G, .git is 571M.

git log shows the entire commit log.

I am tempted to just delete the entire repository at bitbucket and re-upload the slim version of 1.3G/571M

What am I missing?

ADDITION: Now I get

$ ./git_find_big.sh
All sizes are in kB's. The pack column is the size of the object, compressed, inside the pack file.
size     pack     SHA                                       location
3690053  3611690
1633941  206040
1417160  1381048
165448   164633   8ba397bd7aabe9c09b365e0eb9f79ccdc9a7dce5  dymo/DLS8Setup.8.7.1.exe

I.e. the sha and filenames are gone, the bits are still there. (I omitted some of the files before, to avoid clutter)

WTF...

All the git functions create *new* branches. The old ones do not disappear immediately, until a pre-configured amount of time or other trigger has occurred to tell git to clean up orphans and stale branches. — Mad Physicist, Jan 14 '21 at 03:37
As you said, a fast way to bypass that time limit is to clone the right version, delete the server version, and recreate it. Even if you pruned the dead branches properly on your machine, pushing it to the server would not automatically prune them there. — Mad Physicist, Jan 14 '21 at 03:38
Sure, on your local clone. Although there may be some additional configuration you need to set. I'd wait for one of the experts here to take a look at this. I'm more than a casual user, but not too familiar with the detailed behaviors of stuff like filter-branch. — Mad Physicist, Jan 14 '21 at 03:43
It's now 4:48 in the morning, I think I'll do a Scarlet O'Hara: "I can't think about that right now. If I do, I'll go crazy. I'll think about that tomorrow." — Leif Neland, Jan 14 '21 at 03:50
There was a bug of sorts introduced in `git gc` and/or its helpers recently, such that `git gc --prune=now` doesn't work unless you run it twice in a row, or something like that. (Not having a handy test I can't check it locally.) Cloning the filtered clone works better in general. I noted that you ran `git filter-branch` on `HEAD` (master) and did not filter any other names, including tag names, but the last `./git_find_big.sh` scan suggests this was sufficient, so I'm guessing it's just this gc bug. — torek, Jan 14 '21 at 04:23
Also, filter-branch keeps refs to the original content. If you're really really sure, `git filter-branch -f --setup exit` to lose those. You've already wiped the reflogs, so then you can `git repack -ad` to rebuild the object db from scratch. — jthill, Jan 14 '21 at 04:32
Clone the local repository to another path and check if the new clone's size get smaller. Some objects may still be referred to by special refs like reflogs. — ElpieKay, Jan 14 '21 at 07:04
`git_find_big.sh` searches for file names using `git rev-list --all --objects`. `--all` lists refs, but not the reflog. Try adding `--reflog` to that line in the script : `other=\`git rev-list --all --reflog --objects | grep $sha\``. If this "fixes" your output, it means the big blob is referenced from the reflog. — LeGEC, Jan 14 '21 at 08:52
When pruning the reflog, try `git reflog expire --expire=all --expire-unreachable=all --all` — LeGEC, Jan 14 '21 at 09:01

score 3 · Answer 1 · answered Jan 14 '21 at 08:32

Instead of filter-branch, try instead:

a tool like github/git-sizer to get an idea of what is taking so much space.
git filter-repo (that I mentioned here).

Install it first. (python3 -m pip install --user git-filter-repo)

Then, for example:

git filter-repo --strip-blobs-bigger-than 10M

git filter-branch doesn't remove files

1 Answers1