20

I have a 33 MB large file where I want to permanently delete the oldest revisions of that file, so I only the latest X revisions are kept around. How to do it?

My bare repository has grow huge because of it.

I have tried the following.. but it removes the file entirely

git filter-branch --index-filter 'git rm --cached --ignore-unmatch big_manual.txt' HEAD

To identify the large files in my repository I use git-large-blob by Aristotle Pagaltzis.

Community
  • 1
  • 1
neoneye
  • 50,398
  • 25
  • 166
  • 151
  • I think it would help if you gave some more information about this file and what you are trying to do. Is this going to be a one off event or do you plan to purge the file and rewrite the repository history regularly? Why are you tracking the file in git if you don't need to keep its history? How big is your bare repository and is it really a problem if it is big? – CB Bailey May 30 '09 at 22:31
  • it's a manual for my program, I'm writing using Apple Pages (word processor) and it includes a lot of images. I store it in GIT mostly so I can share it between my stationary computer and my laptop, and so I and undo in case something goes wrong. The repository is currently 450 MB. I hesitate working with the file because I know the repository size increases.. Instead of rethinking my backup solution I thought that it would be better to get rid of the oldest revisions. I daily take a full snapshot of the repository and upload it, but disk quota is 3 gb. – neoneye May 30 '09 at 22:43
  • yes, I am hoping that it's possible to do this from time to time. – neoneye May 30 '09 at 22:49
  • Assuming that the rest of your repository is 'normal' code then I think that you should reconsidering tracking this file together with the rest of our code. It will cause your repository to increase in size and if this is going to cause you to resist changing it or force you into rebasing your recent branches all the time then it's probably forcing you into a very painful workflow. – CB Bailey May 30 '09 at 23:10
  • its not just this file, I also track some images for the website that belogns to project, these would be nice to wipe some of the history. My project has more than 1000 files (h/cpp/mm, png, xml, rb, php) and having everything in one place would be nice. I have already split it up into 4 repositories, however they all refer to each other with version numbers.. splitting it up even further and making a backup solution for untracked files that im not interested in. – neoneye May 30 '09 at 23:19

3 Answers3

16

I think you are on the right track with the git filter-branch command you tried. The problem is you haven't told it to keep the file in any commits, so it is removed from all of them. Now, I don't think there is a way to directly tell git-filter-branch to skip any commits. However, since the commands are run in a shell context, it shouldn't be too difficult to use the shell to remove all but the last X number of revisions. Something like this:

KEEP=10 I=0 NUM_COMMITS=$(git rev-list master | wc -l) \
git filter-branch --index-filter \
'if [[ ${I} -lt $((NUM_COMMITS - KEEP)) ]]; then
     git rm --cached --ignore-unmatch big_manual.txt;
 fi;
 I=$((I + 1))'

That would keep big_manual.txt in the last 10 commits.

That being said, like Charles has mentioned, I'm not sure this is the best approach, since you're in effect undoing the whole point of VCS by deleting old versions.

Have you already tried optimizing the git repository with git-gc and/or git-repack? If not, those might be worth a try.

Dan Moulding
  • 211,373
  • 23
  • 97
  • 98
  • 1
    this is the solution! It walked through all 312 revisions and discarded the oldest revisions, perfect. This was very educational. For loops, rev-list.. and calling filter-branch without any commit id which seems unintuitive (will have to investigate how that magic works), but it worked. Thank you for that. Sometimes I use git-gc and fsck, but its not yet something I have automated. Let's not talk about my opinion about VCS :-) – neoneye May 31 '09 at 08:16
  • 1
    >>Let's not talk about my opinion about VCS :-) Fair enough :) I'm glad this worked for you. As for the magic of not specifying a revision, git-filter-branch internally calls git-rev-list to get the list of commits to rewrite. It will pass "HEAD" to git-rev-list as a default ref if you don't specify one. So not specifying anything is the same as specifying "HEAD" (as you did in your example). – Dan Moulding May 31 '09 at 16:57
  • Thanks for the script. I made it into a bash script file and found I needed to adjust it slightly ` #! /bin/bash KEEP=10 I=0 NUM_COMMITS=$(git rev-list master | wc -l) \ git filter-branch --index-filter \ 'if [ ${I} -lt $((NUM_COMMITS - KEEP)) ]; then git rm --cached --ignore-unmatch file-to-delete.tar; fi; I=$((I + 1))' ` – David Thomas Feb 14 '12 at 08:11
15

Note: this answer is about shortening history of a whole project, rather than removing single file from older history what the question was about!


The simplest way to shorten history of a whole project by using git filter-branch would be to use grafts mechanism (see repository layout documentation) to shorten history:

$ echo "$commit_id" >> .git/info/grafts

where $commit_id is a commit that you want to be a root (first commit) of a new repository. Check out using "git log" or graphical history viewer such as gitk that the history looks like you want, and run "git filter-branch --all"; the use of grafts is described in git-filter-branch documentation.

Or you can use shallow clone by using --depth <depth> option of git clone.



You can make use of grafts to remove part history of a single file (what was originally requested) using steps describe below. This solution consists of more steps than solution proposed by Dan Moulding, but each of steps is simpler, and you can check intermediate steps using "git log" or graphical history viewer.

  1. First, select point where you want to have file removed, and mark those commits by creating branches at those points. For example if you want to have file appear for first time in commit f020285b and have it removed in all it ancestors, mark it ancestor (assuming this is ordinary, non-merge commit) using

    $ git branch cleanup f020285b^
    
  2. Second, remove the file from the history beginning with cleanup (i.e. f020285b^) using git-filter-branch, as shown in "Examples" section of git-filter-branch manpage:

    $ git filter-branch --index-filter 'git rm --cached --ignore-unmatch big_manual.txt' cleanup
    

    If you want to remove also all commits which had changed only to removed file you can additionally use --prune-empty option to git-filter-branch.

  3. Next, join rewritten part of history with the rest of history using grafts mechanism:

    $ echo $(git-rev-parse f020285b) $(git rev-parse cleanup) >> .git/info/grafts
    

    Then you can examine histry to check if it is joined correctly.

  4. Last, make grafts permanent (this would make all grafts permanent, but lets assume here that you don't use grafts otherwise) using git-filter-branch,

    $ git filter-branch cleanup..HEAD
    

    and remove grafts (as they are not needed any more), and the cleanup branch

    $ rm .git/info/grafts
    $ git branch -d cleanup
    

Final note: if you remove part of history of some file, you better make sure that project without this file makes sense (and for example compiles correctly).

Community
  • 1
  • 1
Jakub Narębski
  • 309,089
  • 65
  • 217
  • 230
  • yeah, the grafts mechanism indeed seems to be the intended way to do it. Thank you for making me aware of this. Unfortunately I don't have time to experiment with it today. – neoneye May 31 '09 at 08:31
  • The grafts method gould work in some cases, but it will get rid of the history for all files. In this case, neoneye wants to only remove history for *some* files. So I'm not sure grafts would be a suitable solution. And shallow clone is out of the question because shallow repositories are crippled (see git-clone docs for a description of their limitations). – Dan Moulding May 31 '09 at 23:41
  • Dan, yes good point, a solution that only remove history for a single file. Ok, so I won't do any experimenting with grafts. – neoneye Jun 01 '09 at 00:49
3

You might want to consider using git submodules. That way you can keep the images and other big files in another git repository, and the repository that has the source codes can refer to a particular revision of that other repository.

That will help you to keep the repository revisions in sync, because the parent repository contains a link to a particular sub repository revision. It will also let you to remove/rebase old revisions in the sub repository, without affecting the parent repository where your source code is - the removals of old revisions in a sub repository will not mess up the history of the parent repository, because you just update that to which revision the sub repository link in the parent repository points to.

Esko Luontola
  • 73,184
  • 17
  • 117
  • 128