0

I am cleaning a local git repo with a lot of large tarballs in the history. I did the following steps:

  1. List the all the tarball files in the repo
FILE_LIST=`git rev-list master | while read rev; do git ls-tree -lr $rev  | cut -c54- | sed 's/^ +//g;'; done | grep <tarball name> | awk '{print $2}' | sort | uniq | tr '\n' ' '`
  1. Mark them for deletion
git filter-branch --tag-name-filter cat --index-filter "git rm -r --cached --ignore-unmatch $FILE_LIST" --prune-empty -f -- --all
  1. Garbage collection
rm -rf .git/refs/original/ && git reflog expire --expire=now --all && git gc --aggressive --prune=now
  1. Push
git push origin --force --all && git push origin --force --tags

By doing this I reduced the size of the local repo significantly. However, when I got a clean clone from the origin after the above steps, the size of the cloned repo is not reduced, but those large tarballs are gone by verifying

FILE_LIST=`git rev-list master | while read rev; do git ls-tree -lr $rev  | cut -c54- | sed 's/^ +//g;'; done | grep <tarball name> | awk '{print $2}' | sort | uniq | tr '\n' ' '`

I did the garbage collection step again in the cloned repo, the size was not reduced.

Anyone know how I can reduce the repo size on the original server? Thanks in advance.

Keith Thompson
  • 254,901
  • 44
  • 429
  • 631
pyang
  • 109
  • 7
  • I've edited your question so the commands are formatted as code. You might consider editing the commands with line continuations using backslashes so they can be read without scrolling to the side. – Keith Thompson Oct 23 '19 at 22:07
  • https://stackoverflow.com/questions/27867775/how-to-cleanup-garbage-in-remote-git-repo/37253227 – Saurabh P Bhandari Oct 23 '19 at 22:49
  • Hi Saurabh, I can understand that the size of the remote repo is not reduced in the thread you posted because the remote server needs to do the garbage collection as well. But my case is different, I did a fresh clone from the remote after the cleanup and push. I do not expect that the size of this clone is reduced without garbage collection again. But after another garbage collection, the size of this clone was not reduced by a single byte, and strangely I could not find those large tarballs in the history either. I wonder which takes up so much space. – pyang Oct 24 '19 at 21:06
  • BTW, I use du -h command to measure the repo directory size. – pyang Oct 24 '19 at 21:08
  • You might want to look into this https://stackoverflow.com/questions/8185276/find-size-of-git-repo for repo size – Saurabh P Bhandari Oct 26 '19 at 13:53
  • It turned out that if one uses --mirror option when cloning a repo, do the cleaning, and then push --mirror, the remote repo size is reduced too. – pyang Oct 28 '19 at 23:08

1 Answers1

0
git rev-list --all --objects |                                # catalog of everything
git cat-file --batch-check='%(objectname) %(objectsize) %(rest)' |  # sha, size, name
awk '$2>limit{print $1}' limit=$((1*1024*1024))'             # just the oversize ones 

will tell you the largest objects in your repo. Finding the commits that introduced them is a matter of hunting through

git log --all --raw --no-abbrev --pretty=format:%H \
| awk 'NF==1 { commit=$1 } NF!=1 { print commit,$4 }'

for matches to your big objects, writing the big ids to a file and grep -Ffing from that through the raw logs will show you which commits introduced which big object. Figuring out the rest, I'll leave to you.

jthill
  • 55,082
  • 5
  • 77
  • 137