1

I have a project called geoplot that does geospatial plotting in Python. The code for it is distributed via git on GitHub. You can check it out here.

As a part of the development process for this package, I uploaded and stored in the geoplot repo a folder called data/ which contained a large number of data files in various formats. These data files were used to populate the examples in the complimentary example gallery.

However, these files bloat the overall repository size way up to ~150 MiB (issue). This is clearly way too much, and it's time for me to get rid of them.

The problem is that I need to not just remove these files from the current HEAD, I also scrub these files out of the entire git history. I tried a manual approach using git rebase that didn't work. Then I tried the BFG Repo-Cleaner tool, as recommended in the canonical SO question on the matter.

BFG rid me of the files alright—they no longer exist anywhere in the history. However, the size of the repo (as seen when running https://github.com/ResidentMario/geoplot.git) did not go down at all!

Here is what I tried (minus printouts):

java -jar ../bfg-1.12.15.jar --delete-folders "data" .
git reflog expire --expire=now --all && git gc --prune=now --aggressive
git push --set-upstream https://github.com/ResidentMario/geoplot.git master --force

The full printout is in an issue on GitHub.

What, if anything, did I do wrong? How do I diagnose the source of and expunge this wasted space?

anthony sottile
  • 61,815
  • 15
  • 148
  • 207
Aleksey Bilogur
  • 3,686
  • 3
  • 30
  • 57

2 Answers2

1

I did mention reflog and gc back in 2010, but also removing old objects.
(Note: gc should be followed by a repack)

First, check if by cloning your repo again, you still have the same size.

As the OP Aleksey Bilogur mentions in the comments:

  • you need make sure your tag are not referencing the old data, and then you need to force-push all the tags and branches as well (not just master)

    git push --tags origin --force
    
  • generated data must be removed from the repo history.

VonC
  • 1,262,500
  • 529
  • 4,410
  • 5,250
  • I'm following the behavior described in the BFG docs (which is to say, I have little idea what these commands do, exactly). The repo still unpacks to the same total size; see [here](https://github.com/ResidentMario/geoplot/issues/37#issuecomment-330067839). – Aleksey Bilogur Sep 17 '17 at 18:46
  • How about the clone: does a fresh clone has the same size? – VonC Sep 17 '17 at 18:56
  • A fresh `git` clone is ~100 MiB. Anthony [suggested recreating the tags](https://github.com/ResidentMario/geoplot/issues/37), which helped, bringing the size of a tag down to 24 MiB. – Aleksey Bilogur Sep 17 '17 at 22:56
  • However, now that I've handled the tags, I believe I know what the issue is---there are many example images in the repo that have been regened a few times that also need to be cleaned out. – Aleksey Bilogur Sep 17 '17 at 22:58
  • @AlekseyBilogur OK. I have included your comments in the answer for more visibility. Those are good advice that will help others. – VonC Sep 18 '17 at 04:37
-1

This sounds like an issue that could be solved without external tools, by leveraging filter-branch.

If you want to remove all history of the data directory, you can run the following from the root of your repo.

git filter-branch --index-filter 'git rm --cached --ignore-unmatch -r path/to/data' HEAD

That will change every commit in the ancestry of your current HEAD pointer. You would then have to update all other branches and tags to these newly created commits to completely remove the baggage from your repo.

Zach Olivare
  • 3,805
  • 3
  • 32
  • 45