163

I have a GitHub repository that had two branches - master and release.

The release branch contained binary distribution files that were contributing to a very large repository size (more than 250 MB), so I decided to clean things up.

First I deleted the remote release branch, via git push origin :release.

Then I deleted the local release branch. First I tried git branch -d release, but Git said "error: The branch 'release' is not an ancestor of your current HEAD." which is true, so then I did git branch -D release to force it to be deleted.

But my repository size, both locally and on GitHub, was still huge. So then I ran through the usual list of Git commands, like git gc --prune=today --aggressive, without any luck.

By following Charles Bailey's instructions at SO 1029969 I was able to get a list of SHA-1 hashes for the biggest blobs. I then used the script from SO 460331 to find the blobs...and the five biggest don't exist, though smaller blobs are found, so I know the script is working.

I think these blogs are the binaries from the release branch, and they somehow got left around after the delete of that branch. What's the right way to get rid of them?

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
kkrugler
  • 8,145
  • 6
  • 24
  • 18
  • What version of Git are you using? And did you try http://stackoverflow.com/questions/1106529/how-to-skip-loose-object-popup-when-running-git-gui/1108084#1108084 ? – VonC Dec 15 '09 at 04:56
  • git version 1.6.2.3 I'd tried gc and prune w/various arguments. I hadn't tried repack -a -d -l, just ran it, no change. – kkrugler Dec 15 '09 at 14:32
  • 2
    New info - a fresh clone from GitHub no longer has the unreferenced blobs, and is down to "only" 84MB from 250MB. – kkrugler Dec 15 '09 at 14:33

11 Answers11

274

I present to you this useful command, "git-gc-all", guaranteed to remove all your Git garbage until they might come up extra configuration variables:

git -c gc.reflogExpire=0 -c gc.reflogExpireUnreachable=0 -c gc.rerereresolved=0 \
    -c gc.rerereunresolved=0 -c gc.pruneExpire=now gc

You might also need to run something like these first:

git remote rm origin
rm -rf .git/refs/original/ .git/refs/remotes/ .git/*_HEAD .git/logs/
git for-each-ref --format="%(refname)" refs/original/ |
    xargs -n1 --no-run-if-empty git update-ref -d

You might also need to remove some tags:

git tag | xargs git tag -d
Sam Watkins
  • 7,819
  • 3
  • 38
  • 38
  • 1
    Interesting. A good alternative to my more general answer. +1 – VonC Feb 06 '13 at 16:30
  • 10
    This deserves more up votes. It finally got rid of a lot of git objects other methods would keep. Thanks! – Jean-Philippe Pellet Oct 29 '13 at 17:33
  • 1
    Upvoted. Wow, I don't know what I just did but it seems to clean up a lot. Can you elaborate on what it does? I have the feeling it cleared out all my `objects`. What are those and why are they (apparently) irrelevant? – Redsandro Jan 16 '14 at 21:52
  • 2
    @Redsandro, as I understand, those "git rm origin", "rm" and "git update-ref -d" commands remove references to old commits for remotes and such, which might be preventing garbage collection. The options to "git gc" tell it not to hold on to various old commits, else it will hold on to them for a while. E.g. gc.rerereresolved is for "records of conflicted merge you resolved earlier", by default kept for 60 days. Those options are in the git-gc manpage. I'm not an expert on git and don't know exactly what all these things do. I found them from manpages, and grepping .git for commit refs. – Sam Watkins Jan 20 '14 at 05:23
  • 1
    A git object is a compressed file or tree or commit in your git repo, including old stuff from the history. git gc clears out unneeded objects. It keeps objects which are still needed for your current repo, and its history. – Sam Watkins Jan 20 '14 at 05:27
  • How bad is it to run without --no-run-if-empty? (this args is not supported on osx) – Charles L. Dec 10 '14 at 17:32
  • @CharlesL, it doesn't matter, it will just give a harmless error message if there are no refs to delete. – Sam Watkins Dec 11 '14 at 23:36
  • 1
    I wrote a script which uses this one, to delete all history from a git repo: http://sam.nipl.net/b/git-kill-history If run on two consistent repos it produces the same commits / hashes, so there's no need to "re-clone" after destroying history, just run `git-kill-history` on both sides. – Sam Watkins Jan 29 '15 at 03:51
  • 1
    Nothing else I tried worked, but when I ran this it immediately worked. Thanks! – Erik Oct 31 '15 at 18:21
  • What is the 'in' branch referenced in the linked script ? – Zitrax May 23 '16 at 11:16
  • 1
    Should all also mention that any tag that exists in the repository can keep history alive - since this script does not deal with them you need to remove them manually first if you want to clean history leading up tags. For example to remove all local tags you can run: `git tag | xargs git tag -d`. – Zitrax May 23 '16 at 11:33
  • @Zitrax, that 'in' branch is a temporary branch that I use so I can safely merge stuff that is pushed to that branch. Personal setup. – Sam Watkins May 26 '16 at 15:10
  • 1
    Don't use `git gc --aggressive`. https://metalinguist.wordpress.com/2007/12/06/the-woes-of-git-gc-aggressive-and-how-git-deltas-work/ – jpmc26 Jun 03 '16 at 18:59
  • 2
    So this method didn't work for me. I found that references still existed inside `.git/info/refs` and `.git/packed-refs`. Removing these references with vim and then running the command succeeded. Although I'm not entirely sure the evil commits weren't still inside a pack. So I unpacked the packs as http://stackoverflow.com/questions/16972031/how-to-unpack-all-objects-of-a-git-repository for good measure. I would probably advise people to just do a clone and then blow away the original repository. – Att Righ Jan 23 '17 at 17:36
  • 1
    The method I was using to ensure that objects were deleted was `git rev-list --objects --all | awk '{ print $1 }' | xargs -n 1 git cat-file -p | less`. – Att Righ Jan 23 '17 at 17:38
  • You, Sir, are the freaking bomb! I humbly bow in reverence to your enlightening persona! – Fernando Espinosa Jul 26 '17 at 01:04
  • this command does nothing `git for-each-ref --format="%(refname)" refs/original/ | xargs -n1 --no-run-if-empty git update-ref -d` – acgbox May 25 '20 at 22:11
  • @SamWatkins This reduces the git size of the local repo. How do we reflect this change on the remote repo also? Because it will all come back after recloning the repo. Also, `git status` doesn't show any changes so there is no way to push this `cleaning` to repo – Pankaj Singhal Mar 01 '22 at 08:30
  • @PankajSinghal, you could run the cleanup on the remote repo also, or remove the remore repo and create it again from the local repo. – Sam Watkins Mar 02 '22 at 01:26
123

You can (as detailed in this answer) permanently remove everything that is referenced only in the reflog.

WARNING: This will remove many objects you might want to keep:

  • All of your stashes.
  • Old history not in any current branches.

Read the documentation to be sure this is what you want.

To expire the reflog, and then prune all objects not in branches:

git reflog expire --expire-unreachable=now --all
git gc --prune=now

git reflog expire --expire-unreachable=now --all removes all references of unreachable commits in reflog.

git gc --prune=now removes the commits themselves.

Attention: Only using git gc --prune=now will not work since those commits are still referenced in the reflog. Therefore, clearing the reflog is mandatory. Also note that if you use rerere it has additional references not cleared by these commands. See git help rerere for more details. In addition, any commits referenced by local or remote branches or tags will not be removed because those are considered as valuable data by git.

ideasman42
  • 42,413
  • 44
  • 197
  • 320
leoly
  • 8,468
  • 6
  • 32
  • 33
  • 24
    It worked, but somehow I lost my saved stashes in the process (nothing major in my case, just a caution for others) – Amro Jan 14 '17 at 10:51
  • 1
    why not --aggressive ? – JoelFan Feb 10 '17 at 16:35
  • 7
    I think this answer needs a clear warning, preferably at the top. My edit suggestion was rejected, because I guess I should suggest it to the author in a comment? Please either accept this edit https://stackoverflow.com/review/suggested-edits/26023988 or add a warning your own way. Also, this **drops all your stashes**. That should be memtioned in the warning too! – Inigo May 04 '20 at 20:13
  • I tested with git version 2.17 and stashed commits will not be removed by the above commands. Are you sure you didn't run any additional commands? – Mikko Rantalainen May 06 '20 at 17:37
  • 1
    `git fetch --prune` further reduce size because deleting local blobs. – hectorpal Jul 03 '20 at 19:39
  • How to push these repo cleanup changes in all branches? – Harshal Patil Jul 20 '21 at 12:36
  • Can this command be changed to include stashes? I would consider stashes to be part of my repository, so including them would be useful. – ideasman42 Jun 02 '23 at 04:23
35

As mentioned in this SO answer, git gc can actually increase the size of the repo!

See also this thread

Now git has a safety mechanism to not delete unreferenced objects right away when running 'git gc'.
By default unreferenced objects are kept around for a period of 2 weeks. This is to make it easy for you to recover accidentally deleted branches or commits, or to avoid a race where a just-created object in the process of being but not yet referenced could be deleted by a 'git gc' process running in parallel.

So to give that grace period to packed but unreferenced objects, the repack process pushes those unreferenced objects out of the pack into their loose form so they can be aged and eventually pruned.
Objects becoming unreferenced are usually not that many though. Having 404855 unreferenced objects is quite a lot, and being sent those objects in the first place via a clone is stupid and a complete waste of network bandwidth.

Anyway... To solve your problem, you simply need to run 'git gc' with the --prune=now argument to disable that grace period and get rid of those unreferenced objects right away (safe only if no other git activities are taking place at the same time which should be easy to ensure on a workstation).

And BTW, using 'git gc --aggressive' with a later git version (or 'git repack -a -f -d --window=250 --depth=250')

The same thread mentions:

 git config pack.deltaCacheSize 1

That limits the delta cache size to one byte (effectively disabling it) instead of the default of 0 which means unlimited. With that I'm able to repack that repository using the above git repack command on an x86-64 system with 4GB of RAM and using 4 threads (this is a quad core). Resident memory usage grows to nearly 3.3GB though.

If your machine is SMP and you don't have sufficient RAM then you can reduce the number of threads to only one:

git config pack.threads 1

Additionally, you can further limit memory usage with the --window-memory argument to 'git repack'.
For example, using --window-memory=128M should keep a reasonable upper bound on the delta search memory usage although this can result in less optimal delta match if the repo contains lots of large files.


On the filter-branch front, you can consider (with cautious) this script

#!/bin/bash
set -o errexit

# Author: David Underhill
# Script to permanently delete files/folders from your git repository.  To use 
# it, cd to your repository's root and then run the script with a list of paths
# you want to delete, e.g., git-delete-history path1 path2

if [ $# -eq 0 ]; then
    exit 0
fi

# make sure we're at the root of git repo
if [ ! -d .git ]; then
    echo "Error: must run this script from the root of a git repository"
    exit 1
fi

# remove all paths passed as arguments from the history of the repo
files=$@
git filter-branch --index-filter "git rm -rf --cached --ignore-unmatch $files" HEAD

# remove the temporary history git-filter-branch otherwise leaves behind for a long time
rm -rf .git/refs/original/ && git reflog expire --all &&  git gc --aggressive --prune
Community
  • 1
  • 1
VonC
  • 1,262,500
  • 529
  • 4,410
  • 5,250
  • http://stackoverflow.com/questions/359424/detach-subdirectory-into-separate-git-repository is also a good start for the `filter-branch` command usage. – VonC Dec 15 '09 at 16:28
  • Hi VonC - NI'd tried git gc prune=now with no luck. It really looks like a git bug, in that I wound up with unreferenced blobs locally following a branch deletion, but these aren't there with a fresh clone of the GitHub repo...so it's just a local repo problem. But I have additional files that I want to clear out, so the script you referenced above is great - thanks! – kkrugler Dec 16 '09 at 17:01
22

git gc --prune=now, or low level git prune --expire now.

Jakub Narębski
  • 309,089
  • 65
  • 217
  • 230
14

Each time your HEAD moves, Git tracks this in the reflog. If you removed commits, you still have "dangling commits" because they are still referenced by the reflog for about 30 days. This is the safety net when you delete commits by accident.

You can use the git reflog command to remove specific commits, repack, etc., or just the high level command:

git gc --prune=now
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
vdboor
  • 21,914
  • 12
  • 83
  • 96
3

Before doing git filter-branch and git gc, you should review tags that are present in your repository. Any real system which has automatic tagging for things like continuous integration and deployments will make unwanted objects still referenced by these tags, hence gc can't remove them and you will still keep wondering why the size of the repository is still so big.

The best way to get rid of all unwanted stuff is to run git-filter & git gc and then push master to a new bare repository. The new bare repository will have the cleaned-up tree.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
v_abhi_v
  • 115
  • 4
2

You can use git forget-blob.

The usage is pretty simple:

git forget-blob file-to-forget

You can get more information in Completely remove a file from a Git repository with 'git forget-blob'.

It will disappear from all the commits in your history, reflog, tags, and so on.

I run into the same problem every now and then, and every time I have to come back to this post and others. That's why I automated the process.

Credits go to contributors such as Sam Watkins.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
nachoparker
  • 1,678
  • 18
  • 14
  • 1
    This damaged my git repository after I ran it. Now I get: fatal: 'origin' does not appear to be a git repository, when I run git push origin branchname Full error: fatal: 'origin' does not appear to be a git repository fatal: Could not read from remote repository. Please make sure you have the correct access rights and the repository exists. git version 2.22.0 – gbenroscience Feb 24 '21 at 23:32
1

To add another tip, don't forget to use git remote prune to delete the obsolete branches of your remotes before using git gc.

You can see them with git branch -a

It's often useful when you fetch from GitHub and forked repositories...

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Tanguy
  • 2,227
  • 1
  • 18
  • 8
1

Try to use git-filter-branch - it does not remove big blobs, but it can remove big files which you specify from the whole repository. For me it reduces repository size from hundreds MB to 12 MB.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Sergey Miryanov
  • 1,820
  • 16
  • 29
  • 6
    Now _that_ is a scary command :) I'll have to give it a try when my git-fu feels stronger. – kkrugler Dec 15 '09 at 14:36
  • you can say that again. I'm always wary of any commands that manipulate a repository's history. Things tend to go very wrong when multiple people are pushing and pulling from that repository and suddenly a bunch of objects git is expecting aren't there. – Jonathan Dumaine Aug 12 '11 at 19:54
1

Sometimes, the reason that "gc" doesn't do much good is that there is an unfinished rebase or stash based on an old commit.

StellarVortex
  • 576
  • 4
  • 19
  • Or the old commit is referenced by HEAD, ORIG_HEAD, FETCH_HEAD, reflog or some other thing that git automatically keeps up trying to make sure it never loses anything valuable. If you really want to lose all those, you have to go the extra mile to do so. – Mikko Rantalainen May 07 '20 at 06:48
0

Try the approach from this gist:

git gc --prune="0 days"
Henry Ecker
  • 34,399
  • 18
  • 41
  • 57