0

My situation is that many bulky JPGs have made it into our repo, adding 100s of MBs, much more than the src code itself.

I have since optimized these JPGs to consume less than 1/20 their file size, with otherwise no perceivable change. Committed and pushed back.

However, local copies still have this disk space used up in the .git archives (internally containing all previous versions of all files). Anyone new pulling also gets this wasted space.

Our origin master is on Bitbucket.

I have spent considerable time trying to figure out from good guides like

http://otomaton.wordpress.com/2012/12/17/saving-disk-space-by-garbage-collecting-in-git-repositories/ using

 git gc

or http://linux.yyz.us/git-howto.html

and How to remove local (untracked) files from the current Git working tree? suggesting

git clean -n

What might be a way to simply purge only these huge JPG files from only one particular commit from the archives, and even from the online Bitbucket repo so no one has to pull them again? Of course we want

  • The current versions of all files to be kept
  • As much as possible, revision histories before & after preserved, at least meta knowledge that there has been a commit (because other non-jpg files had been affected then too)
  • There are 200+ JPG files. Can this operation be done in one fell swoop? Using wildcards like *.jpg in some parameter, or a for loop?

There has been no prior version in the repo of the large JPG versions of files we don't want.

Among things I tried:

  • Before anything, how much disk space is .git using?
du
72195   ./.git
  • Find heavyweight blobs:
git verify-pack -v .git/objects/pack/pack-*.idx |sort -k 3 -n |tail -39
...
03bcb7d79c1e0a4328420bf00647319465d5d3df blob   2446210 2430913 46915147
52ea2d848645463e01d3dd143dd8d7fd24019335 blob   2467254 2443333 27573576
12d63348c0e87f9602d395e694df6a94601c12f7 blob   2506409 2485495 49346060
645fe7bfaf6ecd0140d144b4c40c19e78f103bd6 blob   2581349 2554398 10567725
72672204aa3c7aec431cba02b32ac012e52e601d blob   3084793 3041294 13122123
  • What did that last big blob contain?
 git rev-list --objects --all |grep 72672204
72672204aa3c7aec431cba02b32ac012e52e601d images/2.jpg
  • Which commits affected this particular file images/2.jpg (one of the many whose unneeded copy I hope to kill)?
git log --pretty=oneline --branches -- images/2.jpg
98dc75de48a63c2ab9661eb62895ac39ef331aaa MAPSDH-10 #time 30m #comment Grab live copy of Simon's source and push it onto Bitbucket repo; master@gordito,2014-04-10_13-55-02
3e7f36f0b1a913feaf43547bca4ad3a5a08957a6 MAPSDH-10 #time 30m #comment Grab live copy of Simon's source and push it onto Bitbucket repo; master@gordito,2014-04-10_13-31-49
  • Okay then, so try to remove only the copy of images/2.jpg prior to commit # 3e7f36f0, inclusive:
 git filter-branch --index-filter 'git rm --cached --ignore-unmatch images/2.jpg'  -- 3e7f36f0^..
Cannot rewrite branches: You have unstaged changes.
  • Since it's refusing, just remove it altogether from the cache:
 git rm --cached --ignore-unmatch images/2.jpg
rm 'images/2.jpg'
  • However, I hope this CURRENT version of images/2.jpg will still be in the repo!

  • Count the file space usage of local git archives:

git count-objects -v
count: 0
size: 0
in-pack: 284
packs: 1
size-pack: 72101
prune-packable: 0
garbage: 0
size-garbage: 0
  • size-pack is still 72101 (72MB, as in origin du). It didn't seem to free up 3084793 (3MB) as expected, anyway.
Community
  • 1
  • 1
Marcos
  • 4,796
  • 5
  • 40
  • 64
  • 'git rm --cached --ignore-unmatch images/2.jpg' appears to remove 'images/2.jpg' from the repo altogether, even the current copy after commit 72672204, so that isn't good. – Marcos Apr 11 '14 at 09:31

1 Answers1

1

Well, You've got these images in history and You should rewrite history and delete them permanently.

I've written a script which removes a file forever from git (history included), here it is:

#!/bin/bash
git filter-branch -f --prune-empty -d /dev/shm/scratch \
  --index-filter "git rm --cached -f --ignore-unmatch $1" \
  --tag-name-filter cat -- --all
rm -rf .git/refs/original/
git reflog expire --expire=now --all
git gc --prune=now
git gc --aggressive --prune=now

You can delete all Your files with it and after it -- commit new files.

More information: http://git-scm.com/book/ch6-4.html

P.S. and if You want to use wildcards -- use some bash magic like for i in *.jpg; do git-rm-forever $i; done

Arenim
  • 4,097
  • 3
  • 21
  • 31
  • Thanks; I've tried that site too. Is there a way to ONLY clean out files prior to and including a commit # 3e7f36f0, and ONLY those images/*.jpg files? This is what I'm looking for. Trying to modify this sort of code to do that. Not clean out all versions, permanently. – Marcos Apr 11 '14 at 09:51
  • 1
    Sure, no, This is not ONLY way. You CAN modify `git filter-branch ....` to filter only refs you should clean, not all ones. But this magic spell requires too high level of git wizardy, I can't cast it :) – Arenim Apr 11 '14 at 09:52