26

So let me preface this question by saying that I am aware of the previous questions pertaining to subject on Stackoverflow. In fact I've tried all the solutions I could find but there is a binary file in my repo that just refuses to be removed and continues to greatly inflate my repo size.

Methods I've tried,

Both of which were recommend by the Darhuuk's answer to Remove files from git repo completely

However, after trying both of those solutions the script to find large files in git still finds the offending binary. However the script from this answer no longer finds the commit for the binary. Both of these scripts were suggest by this answer.

The repo is still 44mb after the attempts at removal, which is way too large for the relative small size of the source. Which suggestions the large file script is doing it's job properly. I've tried pushing up to github (I made a fork just in case) and then doing a fresh clone to see if the repo size was decreased, but it is still the same size.

Can someone explain what I am doing wrong or suggest an alternative method?

I should note that I am not just interested in trimming the file from my local repo, I also want to be able to fix the remote repo on Github.

Community
  • 1
  • 1
James McMahon
  • 48,506
  • 64
  • 207
  • 283
  • Is it possible those methods aren't working because I have multiple branches? – James McMahon Jun 29 '12 at 04:39
  • Yes...if any branches (including remote branches retrieved by fetch) have references to an object, it won't be pruned as unreachable. – Todd A. Jacobs Jun 29 '12 at 06:49
  • So I guess the question becomes, how do I remove the object from the repo that is pulled from Github and then push back up the repo sans binary file? – James McMahon Jun 29 '12 at 14:43
  • I haven't had any luck yet with the methods below, can anyone else suggest a solution? Is there a tool to recreate the repo from scratch, sans the binary file? – James McMahon Jun 29 '12 at 19:52
  • Another update, I have some egg on my face, my local rewrite of history wasn't succeeding because I wasn't using the full path to the file (I could have used a path wildcard as well). So I can get my local repo down in size (down to 1mb from 44mb), but after pushing to the remote Github repo, it is still the same large size as the repo with the binary. – James McMahon Jun 30 '12 at 02:16

4 Answers4

27

2017 Edit: You should probably look into BFG Repo-Cleaner if you are reading this.


So embarrassingly the reason why my local repos were not shrinking in size is because I was using the wrong path to the file in filter-branch. So while I thank J-16 SDiZ and CodeGnome for their answers my problem was between the chair and the keyboard.

In an effort to make this question less of a monument to my stupidity and actually useful to people I've taken the time to write up the steps one would have to go through after trimming the repo in order to get the repo back up on Github. Hope this helps someone out down the line.


Removing offending files

To go about remove the offending files run the shell script below, based the Github remove sensitive data howto

#!/usr/bin/env bash
git filter-branch --index-filter 'git rm -r -q --cached --ignore-unmatch '$1'' --prune-empty --tag-name-filter cat -- --all

rm -rf .git/refs/original/
git reflog expire --expire=now --all
git gc --prune=now
git gc --aggressive --prune=now

I went through every branch on my local repository and did this, but I am honestly not sure if this is needed, (you don't need to do this on every branch) you do however need every branch local for the next step, so keep that in mind. Once you are done you should see the size decrease in your local repo. You should also be able to run the blob script in CodeGnome's answer and see the offending blob remove. If not double check the file name and path and make sure they are correct.

What git filter-branch is actually doing here is running the command listed in quotes on each commit in the repo.

The rest of the script just cleans any cached version of the old data.

Pushing the trimmed repo

Now that the local repo is in the state you need it to be the trick is to get it back up on Github. Unfortunately as far as I can tell there is no way to completely remove the binary data from the Github repo, here is the quote from the Github sensitive data howto

Be warned that force-pushing does not erase commits on the remote repo, it simply introduces new ones and moves the branch pointer to point to them. If you are worried about users accessing the bad commits directly via SHA1, you will have to delete the repo and recreate it.

It sucks that you need to recreate the Github repo, but the good news that recreating the repo is actually pretty easy. The pain is that you also have to recreating the data in issues and the wiki, which I'll go into below.

What I recommend is creating a new repo in github and then switch it out with your old repo when you are ready. This can be done by renaming the old to something like "repo name old" and then changing the name of the newly created repo to "repo name". Make sure when you create the new repo to uncheck initialize with README, otherwise your not going to be dealing with a clean slate.

If you completed the last step you should have your repo cleaned and ready to go. The remotes now need to changed to match the new Github repo location. I do this by editing the .git/config file directly, though I am sure someone is going to tell me that is not the right way to do it.

Before doing the push make sure you have all branches and tags you want to push up in your local repo. Once you are ready push all branches using the follow

git push --all
git push --tags

Now you should have a remote repo to match your trimmed local repo. Double check that all data made just in case.

Now if you don't have to worry about issues or the wiki you are done. If you do read on.

Moving over wikis

The Github wiki is just another repo associated with your main repo. So to get started clone your old wiki repo somewhere. Then the next part is kind of tricky, as far as I can tell you need to click on the wiki tab of your new repo in order to create the wiki, but it seeds the newly created wiki with a an initial file. So what I did, and I am not sure if there is a better way, is change the remote to the newly create wiki repo and do a push to the new location using

git push --all --force

The force is needed here because otherwise git will complain about the tip of the current branch not matching. I think this may leave the initial page in a detached state in the git repo, but the effect of that on the size of the repo should be negligible.

Moving over issues

There is advice on this given by this answer. But looking at the script linked in the answer it looks like it is fairly incomplete, there is a TODO for comment importing and I couldn't tell if it would be bring over the state of issues or not.

So given that I had a fairly small open issues queue and that I didn't mind losing closed issues I elected to bring things over by hand. Note that it is impossible to do this with proper attribution to other people on comments. So I think for a large more established project you would need to write a more robust script to bring everything over, but that wasn't needed for my particular case.

James McMahon
  • 48,506
  • 64
  • 207
  • 283
23

Assuming that you've already removed the blob from your history with git-filter-branch(1) and friends, Git often keeps things around in the reflogs, packfiles, and loose repository objects. The incantation to remove these unreferenced objects is:

git prune --expire=now
git reflog expire --expire-unreachable=now --rewrite --all
git repack -a -d
git prune-packed

If you've done this and you still have a bigger repository than you think you should, then you still have references to your blob somewhere in the repository. You'll have to go back to step one and remove them. This may help:

# List all blobs by size in bytes.
git rev-list --all --objects   |
    awk '{print $1}'           |
    git cat-file --batch-check |
    fgrep blob                 |
    sort -k3nr
Todd A. Jacobs
  • 81,402
  • 15
  • 141
  • 199
  • I'm not sure if I have an older version of Git, but `rev-list` only outputs the hashes for me, so the `awk` pipe is unnecessary. – vergenzt Jun 29 '12 at 12:39
  • 1
    The prune and reflog stuff is already in Underhill's script. No luck even with the extra options. – James McMahon Jun 29 '12 at 14:37
  • I ran your commands and I still have a big file in my repo. I found the blob with you last command, but I'm not sure what to do now. – northben Apr 19 '13 at 14:04
  • This suggestion worked for me for *my local repository*, but I'm still not sure how to propagate this to my remote. `git push ` will just tell me that everything is up to date. Consecutive cloning from the remote are still large. – worldsayshi Oct 10 '13 at 16:16
  • @worldsayshi If you've done a forced push, unreachable objects shouldn't be cloned. However, to actually remove packed or reachable objects you have to perform the commands directly on the remote. You can't perform repository surgery with client/server commands; this is a feature. – Todd A. Jacobs Oct 10 '13 at 17:00
  • @CodeGnome So you are saying that once the references are removed on the remote, cloning the repository should not carry the non-referenced objects with them? This seems intuitive, and when I tried doing this on the remote part today it actually worked. When I tried to force push on a local (bare) repository yesterday this didn't seem to work. The clone came out large. Anyway. I'm happy now. – worldsayshi Oct 11 '13 at 11:25
6

The script in script to find large files in git check the .pack file -- that is, the raw object repository. The second script shows the large object is no longer referenced. If you really want to clean that up, you may do a gc and repack:

git gc --aggressive --prune=now
git repack -A -d

If this still don't help, you may have an object reference in remote branch, you may try

  1. Find out which commit have this object, see Which commit has this blob? and do git branch -a --contains <commit-ish>
  2. Remove the remote branch using git branch -r -D branchname

Update -- What is a "remote branch"?

  • Remote branch is what git fetch things to when you do a git fetch / git pull. (git pull is same as git fetch refspec + git merge remote-branch.

  • If you clone from a remote repository, deleting the remote branch should have no ill effect -- you can always fetch/pull from the remote again using something like git fetch origin refs/heads/master:refs/remotes/origin/master (this pull the master branch from remote to the remote branch remotes/origin/master).

  • If this branch was created by you, deleting should be okay too -- because you should have a "normal" (tracking) branch for that. But you should double confirm this.

Community
  • 1
  • 1
J-16 SDiZ
  • 26,473
  • 4
  • 65
  • 84
  • 1
    Nope, I still see the file after those two commands, the gc command was already in Underhill's script :( – James McMahon Jun 29 '12 at 04:16
  • You are definitely right about the .pack file being the issue though. The vast majority of the repo size is in that one file. – James McMahon Jun 29 '12 at 04:24
  • @JamesMcMahon ok, this means the object is in remote branch (or other refs not in normal branch). see the updated answer – J-16 SDiZ Jun 29 '12 at 05:33
  • +1 for remote branches. I didn't mention them in http://stackoverflow.com/a/2882485/6309, http://stackoverflow.com/a/685422/6309 or http://stackoverflow.com/a/2116892/6309 – VonC Jun 29 '12 at 06:29
  • I took a look at the script you're referencing. It seems to only check packfiles; it doesn't seem to do anything about loose objects. That doesn't invalidate your answer, but it's worth pointing out. – Todd A. Jacobs Jun 29 '12 at 06:52
  • @CodeGnome that script was referenced by the question, I just copied it. – J-16 SDiZ Jun 29 '12 at 07:55
  • Could you expand upon your answer, for instance, what are the ramifications of removing a remote branch? How do I get rid of the object and then push the changes to github? – James McMahon Jun 29 '12 at 14:38
  • please see the update for what is a remote branch. To get rid of it, just delete it like a normal branch. `git branch -D remotes/blar/blar` – J-16 SDiZ Jun 29 '12 at 15:20
  • @JamesMcMahon for deleting it from github -- humm, if you push using standard procedure, remote branch should never get pushed. If you did really wield stuff, you need to do a fresh clone, post the output of `git ls-remote --heads origin` and tell me which one you want to remove. – J-16 SDiZ Jun 29 '12 at 15:23
  • The commit in question is in every branch, both local and remote. – James McMahon Jun 29 '12 at 19:35
  • Interestingly the commit still shows up as being in the local branches even after running Underhill's script. – James McMahon Jun 29 '12 at 19:42
  • I can remove every remote besides remotes/origin/HEAD. When I try to run Underhill's script I get fatal error, missing object and tons of zeros. – James McMahon Jun 29 '12 at 19:51
  • missing object? it should never happend. post the exact error, do a `git fsck`. – J-16 SDiZ Jun 30 '12 at 06:56
  • none of these work. github fucking sucks. – Harlin Jul 27 '23 at 12:56
1

Can someone explain what I am doing wrong or suggest an alternative method?

Have you tried applying DMAIC? Define, Measure, Analyze, Improve, Control.

D - My repo is still large after deleting a file from git history.
M - Determine size of fresh repo using git init to establish baseline.
A - Identify, validate and select root cause. Experiment with git-repo-analysis.
I - Identify, test and implement solution. Maybe BFG Repo-Cleaner will help. Maybe it won't.
C - Sustain the gains. Look at something like Git LFS or other appropriate control method.

I also want to be able to fix the remote repo on Github.

This will depend on how you choose to resolve the problem. For exaple, when using BFG to trim files from history it'll rewrite history and update commit SHAs so there's going to be some give and take here depending on your specific needs and desired outcomes.

vhs
  • 9,316
  • 3
  • 66
  • 70