1

As a continution of this question, I would like to:

  • keep n last commit
  • remove all n + 1 and previous commits from history
  • make commit n like a newly init-ed repository

For example, consider the following commit tree:

1-2-3-4-5

After the next commit, I want it to be:

2-3-4-5-6

However, what happens in 1 should no longer be tracked. i.e.: 2 should be the root now. This is needed because there would be a lot of binary files involved and I don't want git to store what should be already gone. Using git rebase + squash will still keep the file in the history, only the commits squashed.

The idea is to use git as a periodic backup system, supporting up to last n commits. In the reality, there will be a cron job committing what's changed every day.

Community
  • 1
  • 1
LeleDumbo
  • 9,192
  • 4
  • 24
  • 38
  • Git is not a backup tool. Are you sure it's the right tool for this problem? Why not use rsync with rotation? – knittl Sep 11 '14 at 08:56
  • Because we base our idea on VCS. Didn't know that rsync can do backups that doesn't sacrifice much space, I'll investigate. Thanks. – LeleDumbo Sep 11 '14 at 09:13

2 Answers2

0

I think you'd like to delete the blob objects that are not reachable from commit objects. These objects are called unreachable or dangling objects. Actually Git has already provided a good cleanup mechanism which is called auto gc. You can also manually run git gc. It will compress file versions and remove unreachable objects with a few months old for you. The compression work is probably better than what you think. According to my test, it stores the differences of file versions even for binary files.

There're a few more commands related including git fsck, git prune, git repack and git prune-packed, if you want more manual and customized behavior.

But my suggestion is just to set gc.reflogExpireUnreachable and gc.reflogExpire shorter like 1 day, and run "git gc" periodally, and let git do the work for you. But I'm not sure if it's practical since I haven't tested it.

Some references attached.

http://git-scm.com/book/en/Git-Internals-Maintenance-and-Data-Recovery#Maintenance

Occasionally, Git automatically runs a command called “auto gc”. Most of the time, this command does nothing. However, if there are too many loose objects (objects not in a packfile) or too many packfiles, Git launches a full-fledged git gc command. The gc stands for garbage collect, and the command does a number of things: it gathers up all the loose objects and places them in packfiles, it consolidates packfiles into one big packfile, and it removes objects that aren’t reachable from any commit and are a few months old.

http://git-scm.com/docs/git-gc

git gc runs a number of housekeeping tasks within the current repository, such as compressing file revisions (to reduce disk space and increase performance) and removing unreachable objects which may have been created from prior invocations of git add.

Landys
  • 7,169
  • 3
  • 25
  • 34
  • I'm still confused with the definition of "unreachable from any commit". Squashed commits should still have reference to those deleted files, only merged as one commit. Thus, from my understanding, those files are still reachable. – LeleDumbo Sep 11 '14 at 09:17
  • After squashing into one commit, only the file versions referred by the left commit object is reachable. I.E. we add a file `test.txt` to commit `a', delete it in commit `b`, and squash `a` and `b` to `c`, and then all blobs for `test.txt` with the commit `a` and `b` are "unreachable". Here "unreachable" means only referred by reflogs. After it's expired, gc will remove it. You can use `git fsck --full --unreachable --no-reflogs` to have a look. – Landys Sep 11 '14 at 11:26
  • I think this answer is not the solution, LeleDumbo is trying to delete binary blobs from commit 1, so, he needs to make commit 1 unreachable, I will write an answer to achieve that – dseminara Sep 11 '14 at 17:48
  • @dseminara I think making commit 1 unreachable is OP's last question already answered that he mentioned. That's why I focused on deteting blobs. Anyway, it's not git's common purpose to be used in this way. But I just want to say it can. – Landys Sep 11 '14 at 23:45
0

I think the best way to acchive this is by creating an orphan branch from 2 and then doing a rebase, this way:

git checkout 2
git checkout --orphan newmaster # creates a new orphan branch with no parents
git commit -C 2 # commits all the contents from 2 using same commit message of 2
git rebase --onto HEAD 2 master # rebase all contents from master to this new branch
git push -f origin master:refs/heads/master # push the new master branch

Note we are using -f (force) on the last command, this should be an exception, no the rule and all of this should be done on a "freezed" repo, commit 1 will be unreachable now and any blob or content associated to that commit will be removed by git gc if there is no other references remaining (you can run git gc by hand, or it runs automatically depending on the setup of your git server)

Another option: using git-filter-branch

If your problem are heavy files taking too much repository space, you don't need to rewrite the history to delete these files from repostory, git-filter-branch is the tool designed for this kind of situations, this is a basic example:

git filter-branch --tree-filter 'rm path/to/heavyfile; true'

It reconstruct all the history of current branch (e.g. master branch), but, execute your bash command for each commit, removing /path/to/heavyfile for all commits on this case. Of course you can improve the script, for example, removing entire directories, renaming files or even calling your own external commands

The best thing of this, is that this action can be easily undone in case you made some mistake, undo a filter-branch is as easy as:

git reset --hard HEAD@{1}

More on git-filter-branch: http://git-scm.com/docs/git-filter-branch

More on rewriting history with git: http://git-scm.com/book/en/Git-Tools-Rewriting-History

dseminara
  • 11,665
  • 2
  • 20
  • 22