10

My Git repo has hundreds of gigabytes of data, say, database backups, so I'm trying to remove old, outdated backups, because they're making everything larger and slower. So I naturally need something that's fast; the faster, the better.

How do I squash (or just plain remove) all commits except for the most recent ones, and do so without having to manually squash each one in an interactive rebase? Specifically, I don't want to have to use

git rebase -i --root

For example, I have these commits:

A .. B .. C ... ... H .. I .. J .. K .. L

What I want is this (squashing everything in between A and H into A):

A .. H .. I .. J .. K .. L

Or even this would work fine:

H .. I .. J .. K .. L

There is an answer on how to squash all commits, but I want to keep some of the more recent commits. I don't want to squash the most recent commits either. (Especially I need to keep the first two commits counting from the top.)

(Edit, several years later. The right answer to this question is to use the right tool for the job. Git is not a very good tool to store backups, no matter how convenient it is. There are better tools.)

sanmai
  • 29,083
  • 12
  • 64
  • 76
  • 3
    Hundreds of GB in a git repo? This sounds like a bad idea... – nneonneo Jun 11 '14 at 02:09
  • Can you give an example of what you'd do by hand? – nneonneo Jun 11 '14 at 02:09
  • "squash" and "remove" are rather different operations; squashing keeps the changes and removing would discard the changes (i.e. rebase your recent changes onto some older point). – M.M Jun 11 '14 at 02:11
  • @MattMcNabb right, so be it `kill` instead; what I mean that I don't care what happens with them, only I need the data; e.g. if we take a snapshot of a commit 10004, remove all commits before it, and make commit 10004 a root commit, I'll be just fine – sanmai Jun 11 '14 at 02:16
  • @nneonneo usual interactive rebase stuff – sanmai Jun 11 '14 at 02:23
  • @sanmai: I mean, what would you edit the rebase script to be? – nneonneo Jun 11 '14 at 02:24
  • Having a lot of commits won't necessarily bloat the size of your Git repo. Git is very efficient at compressing text-based files. Are you sure that the number of commits is the actual problem that leads to your large repo size? A more likely candidate is that you have too many binary assets versioned, which Git doesn't compress as well (or at all) compared to plain text files. –  Jun 11 '14 at 03:09
  • There's got to be a better canonical question than this: [Remove old binary revisions from git and reduce size of git repository](http://stackoverflow.com/q/14284370/456814). Unfortunately this is a common problem, and so there's a lot of duplicates lying around. –  Jun 11 '14 at 03:41
  • @sanmai you'll have to more clearly define what you mean by "automatically". Do you mean you want to remove the commits using a single command? However, that is a rather trivial concern given that you haven't really explained in what way the size of your repo is bloated. What is it bloated with? Binary files? As I've already stated, Git is pretty good at compressing plain text files. Throwing away perfectly good history because you versioned binary files might not always be the best decision. –  Jun 11 '14 at 04:31
  • @Cupcake, sure unless you have hundreds of gigabytes of them; `git gc` took hours before I begin ripping old commits – sanmai Jun 11 '14 at 04:33
  • @sanmai a typical (but probably not the only way) to remove binary files is using `git filter-branch` with an efficient `--index-filter` command. If you use a good index filter, the operation should be able to run pretty fast. –  Jun 11 '14 at 04:34
  • @Cupcake I need my files, else why would I add them to git in the first place? – sanmai Jun 11 '14 at 04:34
  • @sanmai in general, Git is ill-suited for versioning binary files, because of the eventual size bloat problem. I've heard that some people use [`git annex`](https://git-annex.branchable.com/). You could also possibly look into [Managing large binary files with git](http://stackoverflow.com/q/540535/456814). Or you could just keep nuking your old history when your repo gets too big. Up to you. –  Jun 11 '14 at 04:38
  • Related: [Completely remove files from Git repo and remote on GitHub](http://stackoverflow.com/q/5563564/456814). –  Jun 11 '14 at 04:50
  • @Cupcake I need to keep the first two commits, so it isn't a duplicate even near – sanmai Jul 23 '14 at 08:44
  • @Cupcake I've added an example. Also I need this done *automatically*. The question you're referring does that *by hand*, where in the question I explicitly say that *I don't want to do this by hand*. – sanmai Jul 23 '14 at 12:23
  • @Cupcake the other question is about `--root` option for git-rebase, so this question *is not the [same question](http://stackoverflow.com/help/duplicates)*, hence it is not a duplicate. – sanmai Jul 23 '14 at 12:36
  • @Cupcake thank you for your help and generosity. I'll see if your answer fits soon enough – sanmai Jul 23 '14 at 14:09
  • If anyone is looking for the **interactive** way to squash the first X commits of their commit history, then please see [Combine the first two commits of a Git repository?](http://stackoverflow.com/q/435646/456814). –  Jul 23 '14 at 14:21

3 Answers3

3

The original poster comments:

if we take a snapshot of a commit 10004, remove all commits before it, and make commit 10004 a root commit, I'll be just fine

One way to do this is here, assuming your current work is called branchname. I like to use a temp tag whenever I do a large rebase to double-check that there were no changes and to mark a point I can reset back to if something goes wrong (not sure if this is standard procedure or not but it works for me):

git tag temp

git checkout 10004
git checkout --orphan new_root
git commit -m "set new root 10004"

git rebase --onto new_root 10004 branchname

git diff temp   # verification that it worked with no changes
git tag -d temp
git branch -D new_root

To get rid of the old branch you'll need to delete all tags and branch tags on it; then

git prune
git gc

will clean it from your repo.

Note that you'll temporarily have two copies of everything, until you have gc'd, but that is unavoidable; even if you do a standard squash and rebase you still have two copies of everything until the rebase finishes.

Community
  • 1
  • 1
M.M
  • 138,810
  • 21
  • 208
  • 365
  • I have a three comments. 1st, you can also use a simple branch to save your previous state instead of a light-weight tag (I think the light-weight tag is just another reference, like a branch). You can also use `@{1}` directly after the rebase to refer to the 1st previous position of `` as well. 2nd, another way to do this, instead of using an orphan branch, is to just use a hard reset, followed by a soft reset to the root, commit, then rebase the other commits on top again. –  Jun 11 '14 at 03:13
  • Finally, but most importantly, if the goal is to reduce the size of the repo, the total number of commits is unlikely to the be source of the bloat, [as I explained above](http://stackoverflow.com/questions/24153548/automatically-squash-or-remove-all-commits-except-for-a-number-of-newest#comment37275724_24153548). –  Jun 11 '14 at 03:13
3

Fastest counting implementation time is almost certainly going to be with grafts and a filter-branch, though you might be able to get faster execution with a handrolled commit-tree sequence working off rev-list output.

Rebase is built to apply changes on different content. What you're doing here is preserving contents and intentionally losing the change history that produced them, so pretty much all of rebase's most tedious and slow work is wasted.

The payload here is, working from your picture,

echo `git rev-parse H; git rev-parse A` > .git/info/grafts  
git filter-branch -- --all

Documentation for git rev-parse and git filter-branch.

Filter-branch is very careful to be recoverable after a failure at any point, which is certainly safest .... but it's only really helpful when recovery by simply redoing it wouldn't be faster and easier if things go south on you. Failures being rare and restarts usually being cheap, the thing to do is to do an un"safe" but very fast operation that is all but certain to work. For that, the best option here is to do it on a tmpfs (the closest equivalent I know on Windows would be a ramdisk like ImDisk), which will be blazing fast and won't touch your main repo until you're sure you've got the results you want.

So on Windows, say T:\wip is on a ramdisk, and note that the clone here copies nothing. As well as reading the docs on git clone's --shared option, do examine the clone's innards to see the real effect, it's very straightforward.

# switch to a lightweight wip clone on a tmpfs
git clone --shared --no-checkout . /t/wip/filterwork
cd !$

# graft out the unwanted commits
echo `git rev-parse $L; git rev-parse $A` >.git/info/grafts
git filter-branch -- --all

# check that the repo history looks right
git log --graph --decorate --oneline --all

# all done with the splicing, filter-branch has integrated it
rm .git/info/grafts

# push the rewritten histories back
git push origin --all --force

There are enough possible variations on what you might be wanting to do and what might be in your repo that almost any of the options on these commands might be useful. The above is tested and will do what it says it does, but that might not be exactly what you want.

bsvingen
  • 2,699
  • 14
  • 18
jthill
  • 55,082
  • 5
  • 77
  • 137
  • I took the links out from your code because it looked like they were just syntax highlighted, and it wasn't obvious that they were links. –  Jul 23 '14 at 19:14
2

An XY Problem

Note that the original poster has an XY problem, where he's trying to figure out how to squash his older commits (the Y problem), when his real problem is actually trying to reduce the size of his Git repository (the X problem), as I've mentioned in the comments:

Having a lot of commits won't necessarily bloat the size of your Git repo. Git is very efficient at compressing text-based files. Are you sure that the number of commits is the actual problem that leads to your large repo size? A more likely candidate is that you have too many binary assets versioned, which Git doesn't compress as well (or at all) compared to plain text files.

Despite this, for the sake of completeness, I will also add an alternative solution to Matt McNabb's answer to the Y problem.

Squashing (Hundreds or Thousands) of Old Commits

As the original poster has already noted, using an interactive rebase with the --root flag can be impractical when there are many commits (numbering in the hundreds or thousands), particularly since the interactive rebase won't run efficiently on such a large number of them.

As Matt McNabb pointed out in his answer, one solution is to use an orphan branch as a new (squashed) root, then to rebase on top of that. Another solution is to use a couple of various resets of the branch to achieve the same effect:

# Save the current state of the branch in a couple of other branches
git branch beforeReset
git branch verification

# Also mark where we want to start squashing commits
git branch oldBase <most_recent_commit_to_squash>

# Temporarily remove the most recent commits from the current branch,
# because we don't want to squash those:
git reset --hard oldBase

# Using a soft reset to the root commit will keep all of the changes
# staged in the index, so you just need to amend those changes to the
# root commit:
git reset --soft <root_commit>
git commit --amend

# Rebase onto the new amended root,
# starting from oldBase and going up to beforeReset
git rebase --onto master oldBase beforeReset

# Switch back to master and (fast-forward) merge it with beforeReset
git checkout master
git merge beforeReset

# Verify that master still contains the same state as before all of the resets
git diff verification

# Cleanup
git branch -D beforeReset oldBase verification

# As part of cleanup, since the original poster mentioned that
# he has a lot of commits that he wants to remove to reduce
# the size of his repo, garbage collect the old, dangling commits too
git gc --prune=all

The --prune=all option to git gc will ensure that all dangling commits are garbage collected, not only just the ones that are older than 2 weeks, which is the default setting for git gc.

Community
  • 1
  • 1