19

I have a repository for storing some large binary files (tifs, jpgs, pdfs) that is growing pretty large. There is also a fair amount of files that are created, removed, and renamed and I don't care about the individual commit history. This question is somewhat simplified because I'm dealing with a repository that has no branches and no tags.

I'm curious if there's an easy way to remove some of the history from the system to save space.

I found an old thread on the git mailing list but it doesn't really specify how to use this (i.e. what the $drop is):

git filter-branch --parent-filter "sed -e 's/-p $drop//'" \
        --tag-name-filter cat -- \
        --all ^$drop 
greggles
  • 2,089
  • 5
  • 20
  • 38
  • curious, from you 10Gb project file, how much space where you able to save? 2Mbs? 25mb or like 200Mgb?! – mfaani Sep 12 '17 at 19:01
  • In my case, 90% of the files in the repository were still needed, so it only saved ~10% of space. – greggles Sep 12 '17 at 19:20
  • you mean you saved 1Gb?! OR 10% of the meta data related to git?Which was how much? – mfaani Sep 12 '17 at 20:09
  • 1
    Yes, from 10GB it saved 1GB. But the amount saved will depend greatly on how many files have been modified or deleted in your repo. Some repos it might remove 99%, others it will remove 0%. – greggles Sep 12 '17 at 23:13

5 Answers5

12

You could always just delete .git and do a fresh git init with one initial commit. This will, of course, remove all commit history.

ezod
  • 7,261
  • 2
  • 24
  • 34
  • 2
    Yep, definitely considering this as an easy but drastic option. I would archive off the .git repo and then do this. I'm hoping for something a little less drastic :) – greggles Oct 12 '12 at 23:15
  • 1
    `git init`. Why `--init` ? – E Ciotti Sep 23 '14 at 15:19
  • 4
    basically: `move .git /somewhere/else; git init; git add .; git commit -m "initial commit"; git add origin [repoUrl]; git push origin --force` – E Ciotti Sep 23 '14 at 15:21
  • I did this, but the remote repo on Github somehow still had the commits around. It said `1 commit`, which was great, but links to the old commits still worked. What's worse, when I cloned the repo fresh from github, it was still as big as ever, even though the one where I `rm -rf .git` is now small. I totally don't get it. There's one commit. Pruning doesn't help. `git gc` doesn't help. – mlissner Apr 20 '22 at 23:40
  • 1
    OK, for future visitors, the answer to my problem is: Wait. Github just takes a while sometimes, apparently. I cloned again after trying to fix this for an hour, and the clone is now nice and lean. Bleh, so much for that hour of my life. – mlissner Apr 20 '22 at 23:58
  • For future visitors following the basic steps in E Ciotti's excellent comment: I found the way to add the remote was: `git remote add origin `, based on [here](https://stackoverflow.com/a/47984500/8508004). Plus the step after that I need variations, partly to deal with `git init` seeming to create `master` branch as default but my github repo had `main` (and no `master` branch): `git branch -m master main`; `git symbolic-ref refs/remotes/origin/HEAD refs/remotes/origin/main`; `git push --set-upstream origin main --force`. Based on ... – Wayne Feb 18 '23 at 19:53
  • [here](https://pythonforundergradengineers.com/how-to-change-a-github-repo-from-master-to-main.html) and what git suggested as doing trying to push at end. – Wayne Feb 18 '23 at 19:54
12

I think, you can shrink your history following this answer:

How to delete a specific revision of a github gist?

Decide on which points in history, you want to keep.

pick <hash1> <commit message>
pick <hash2> <commit message>
pick <hash3> <commit message>   <- keep
pick <hash4> <commit message>
pick <hash5> <commit message>
pick <hash6> <commit message>   <- keep
pick <hash7> <commit message>
pick <hash8> <commit message>
pick <hash9> <commit message>
pick <hash10> <commit message>  <- keep

Then, leave the first after each "keep" as "pick" and mark the others as "squash".

pick   <hash1> <commit message>
squash <hash2> <commit message>
squash <hash3> <commit message>   <- keep
pick   <hash4> <commit message>
squash <hash5> <commit message>
squash <hash6> <commit message>   <- keep
pick   <hash7> <commit message>
squash <hash8> <commit message>
squash <hash9> <commit message>
squash <hash10> <commit message>  <- keep

Then, run the rebase by saving and quitting the editor. At each "keep" point, the message editor will pop up for a combined commit message ranging from the previous "pick" up to the "keep" commit. You can then either just keep the last message or in fact combine those to document the original history without keeping all intermediate states.

After that rebase, the intermediate file data will still be in the repository but now unreferenced. git gc will now indeed get you rid of that data.

Community
  • 1
  • 1
Tilman Vogel
  • 9,337
  • 4
  • 33
  • 32
  • 2
    This seems like it might be helpful if I just squash every single commit (or every commit before X date) but that seems tedious. Is there a more automated way to do it? – greggles Oct 12 '12 at 23:12
  • 1
    Also, my whole goal is to save disk space so I wonder if you have some stats on how much space this might save in a large repo (~10GB of relatively large files). If I just remove meta-data but not information about removed objects then I think this won't help much. – greggles Oct 12 '12 at 23:14
  • 2
    By removing a commit, you are removing the metadata and references to the tree data. If that means the last reference is dropped (no other commit reference the specific contents), the actual payload is removed on next `gc`. E.g., if you are squashing all commits from the addition of a given file up to the commit in which it is removed again, the file data will actually be dropped at `gc`. – Tilman Vogel Oct 13 '12 at 09:14
4

$drop is a variable (that you want to looking for)

If you want to clean up unnecessary files and optimize the local repository you must check the command git gc

And git prune is another option because it removes objects that are no longer pointed to by any object in any reachable branch.

I hope this could help you.

Iver
  • 180
  • 8
  • This does not apply to any objects that still are in the history and that's what I think the question refers to. – Tilman Vogel Oct 12 '12 at 20:59
  • These seem helpful, but I'm still confused on how to use that command (e.g. what arguments to tweak to keep more or less history). – greggles Oct 12 '12 at 23:15
  • "git gc" calls "git prune". See https://git-scm.com/docs/git-prune#_notes – Hackless Jun 03 '17 at 20:58
3

If you want to find and remove large files from your Git history, Pro Git has a section called Removing Objects, which guides you through this process. It's a bit complicated, but it would allow you to remove files from your history that you have deleted anyway, while keeping the rest of your history intact.

kaezarrex
  • 1,246
  • 11
  • 8
3

It is a bit complicated to have git forget about a file.

git rm will only remove the file on this branch from now on, but it remains in history and git will remember it.

The right way to do it is with git filter-branch, as others have mentioned here. It will rewrite every commit in the history of the branch to delete that file.

But, even after doing that, git can remember it because there can be references to it in reflog, remotes, tags and such.

I wrote a little utility called git forget-blob

https://ownyourbits.com/2017/01/18/completely-remove-a-file-from-a-git-repository-with-git-forget-blob/

It is easy, just do git forget-blob file1.txt.

This will remove every reference, do git filter-branch, and finally run the git garbage collector git gc to completely get rid of this file in your repo.

nachoparker
  • 1,678
  • 18
  • 14