4

I have a web application and use git to not only manage source control but also deploy changes. I push the changes to the remote repo on github and my webserver has a webhook, which then updates according to these changes.

Now I noticed that my local git repository is around 9GB. I cloned the repo from github and notived that a even then my repo is roughly 1.5GB.

I am pretty sure most of this is unnecessary bloat from the initial development phase. I would like to get rid of it to free up disk space. I have googled a bit, but only find relatively complicated solutions. My scenario is one branch, one developer, lots of tiny commits.

Is there a simple way to get rid of changes that are older than i.e. 12 months, that will result in freeing-up space locally and remotely?

Thanks

Joseph
  • 9,171
  • 8
  • 41
  • 67
  • 1
    …have you run `git gc` lately? Are you using LFS? – Dai Oct 08 '22 at 05:46
  • One cannot remove commits that `master` depends on. – Mateen Ulhaq Oct 08 '22 at 05:47
  • Possible duplicate of this question: https://stackoverflow.com/questions/24495239/git-pull-ignores-depth-how-not-to-pull-the-entire-history – Ashley Oct 08 '22 at 05:47
  • It's also certainly worth looking at the answer to this question here: https://stackoverflow.com/questions/23986685/pull-updates-with-git-after-cloned-with-depth-1 Which explains how to pull with a given depth, so that you don't pull all of the history. – Ashley Oct 08 '22 at 05:51
  • 1
    @Dai didn't know that existed - freed up 3GB! – Joseph Oct 08 '22 at 05:53
  • @Ashley but pulling a specific depth, implies that the storage is still being used on the remote – Joseph Oct 08 '22 at 05:54
  • @MateenUlhaq would the only solution be to just start the repo from scratch? Can I then compress all old commits into one? – Joseph Oct 08 '22 at 05:54
  • @Joseph Why should remote repo size be a huge concern? …unless you have an awful git host that charges storage per-byte… – Dai Oct 08 '22 at 05:55
  • I don't think it's a very good idea (and I'm not sure if it's even possible) to delete git history from the remote, since it's very likely to break things. Also as Dai said the remote size shouldn't be an issue really, especially since you are using github. – Ashley Oct 08 '22 at 05:59
  • @Dai because I like to keep things tidy. I find the attitude to not care about disk usage wasteful. Same reason I prune my photo library on my phone. Why save things I no longer need? – Joseph Oct 08 '22 at 06:00
  • @Joseph “Why save things I no-longer need” - **weeelllll** consider that the reason we all use source-control today is because _we can’t predict when we need to go back-in-time_ - so it’s better just to keep everything from day 1. Plus `git blame` wouldn’t be much use if history was truncated. – Dai Oct 08 '22 at 06:02
  • Git relies on a git history to function properly, if you don't want to be preserving old versions of your code then perhaps it might be worth looking into other version control solutions. Having said that, I think all the big version control solutions out there all try very hard to never lose any data. – Ashley Oct 08 '22 at 06:02
  • 1
    If you aren't expecting anyone that had previously cloned your repo to pull or push, then you can certainly secretly squash and purge your history. Otherwise, it is impossible unless you can generate a small commit with the exact SHA hash required. – Mateen Ulhaq Oct 08 '22 at 06:03
  • If it's really important to you that you aren't taking up too much space on github's servers, however, you could try squashing all of your old commits into a single commit. Here's a guide on it that looks pretty good: https://www.cloudbees.com/blog/git-squash-how-to-condense-your-commit-history but unfortunately it is a little complicated. – Ashley Oct 08 '22 at 06:05
  • I am and will always be the only developer. So other repo clones is not a problem. And as far as "just keep old data around - you never know". This repo is not that mission critical - it is totally iterative - I have never gone back and retrieved old source code in this project. I for my part want to reduce my environmental impact - no matter how small - things add up. https://www.bbc.com/future/article/20200305-why-your-internet-habits-are-not-as-clean-as-you-think – Joseph Oct 08 '22 at 06:09
  • @Ashley thanks will look into squashing everything into one commit - tower makes that fairly easy. – Joseph Oct 08 '22 at 06:09
  • If you're literally never going back to old versions, why use version control at all? Why not just put your code in a google drive or something? – Ashley Oct 08 '22 at 06:11

2 Answers2

2

To remove history older than X, you need to rewrite the history of your repo, and perhaps the most efficient way to rewrite a large repo is using git-filter-repo. Note git-filter-repo is a python script so you'll need python too in case you don't have it installed already.

Once you have git-filter-repo ready to go, the steps to answer your question are rather simple, and a similar scenario is even described in the Git manual for git-replace.

Basically, the steps are:

  1. Make a new parentless commit (a.k.a. root commit) that is equivalent to the state of the first commit you wish to keep.
  2. Replace that commit in your current history with the new root commit.
  3. Make it permanent using git-filter-repo.

For example, suppose the first commit you wish to keep has a commit hash of X:

  1. echo 'Truncate history to single commit' | git commit-tree X^{tree}

    The output of the above command will be a new commit hash, let's call it Y.

  2. git replace X Y

  3. git filter-repo --force

Note: git-filter-repo only touches your local repo. If you're happy with your new re-written repo you can re-add your remote and push it out.

TTT
  • 22,611
  • 8
  • 63
  • 69
1

If you can pick a commit to start from (and forget everything behind that single commit), I can offer you a script that can... let's call it "regrow" all commits past that commit and it would do it actually quite fast.... of course, you would be rewriting history, just want to make it clear.

https://github.com/eantoranz/git/blob/replay/contrib/replay.

The way to use it would be:

  • pick a commit that you would like to start rewriting your history from, will forget about everything behind. Create a branch on it.
git branch oldbase <some-commit-id>

Then create an orphan branch from that commit, so that you clean up all previous history

git checkout --orphan newbase oldbase
git commit -C oldbase # create a commit using the same comment as old-base

Now is when the script comes into play

the-replay-script --new-base newbase --old-base oldbase --tip master

That will replay all commits in the oldbase..master range on top of the newbase commit. It will print a single commit ID in the end. Take a look at the commit (check it out, log it, etc). WHen you are certain that's what you would like to have as your new master:

git branch -f master <the-commit-written-by-the-script>
git checkout master

And feel free to force-push-it where you would like to.

eftshift0
  • 26,375
  • 3
  • 36
  • 60