82

(solved, see bottom of the question body)
Looking for this for a long time now, what I have till now is:

Pretty much the same method, but both of them leave objects in pack files... Stuck.
What I tried:

git filter-branch --index-filter 'git rm --cached --ignore-unmatch file_name'
rm -Rf .git/refs/original
rm -Rf .git/logs/
git gc

Still have files in the pack, and this is how I know it:

git verify-pack -v .git/objects/pack/pack-3f8c0...bb.idx | sort -k 3 -n | tail -3

And this:

git filter-branch --index-filter "git rm -rf --cached --ignore-unmatch file_name" HEAD
rm -rf .git/refs/original/ && git reflog expire --all &&  git gc --aggressive --prune

The same...

Tried git clone trick, it removed some of the files (~3000 of them) but the largest files are still there...

I have some large legacy files in the repository, ~200M, and I really don't want them there... And I don't want to reset the repository to 0 :(

SOLUTION: This is the shortest way to get rid of the files:

  1. check .git/packed-refs - my problem was that I had there a refs/remotes/origin/master line for a remote repository, delete it, otherwise git won't remove those files
  2. (optional) git verify-pack -v .git/objects/pack/#{pack-name}.idx | sort -k 3 -n | tail -5 - to check for the largest files
  3. (optional) git rev-list --objects --all | grep a0d770a97ff0fac0be1d777b32cc67fe69eb9a98 - to check what are those files
  4. git filter-branch --index-filter 'git rm --cached --ignore-unmatch file_names' - to remove a file from all revisions
  5. rm -rf .git/refs/original/ - to remove git's backup
  6. git reflog expire --all --expire='0 days' - to expire all the loose objects
  7. git fsck --full --unreachable - to check if there are any loose objects
  8. git repack -A -d - repacking
  9. git prune - to finally remove those objects
Boris Churzin
  • 1,245
  • 1
  • 10
  • 23
  • Possible duplicates: http://stackoverflow.com/questions/2100907/how-to-purge-a-huge-file-from-commits-history-in-git/2158271 http://stackoverflow.com/questions/872565/how-do-i-remove-sensitive-files-from-gits-history – Greg Bacon Jan 29 '10 at 20:58
  • zneak - my question is in the title. gbacon - tried those, the files still remain in the pack file... – Boris Churzin Jan 29 '10 at 22:52
  • If you look at the article referenced in the duplicates, it shows how to compact your object store after the offending file has been removed. – Kyle Butt Jan 30 '10 at 01:30
  • You mean `git gc --aggressive --prune` Didn't work, it repacked everything, and the file is still there... – Boris Churzin Jan 30 '10 at 01:51
  • Does the blob in question show up in the output from `git fsck --full --unreachable`? – Dan Moulding Feb 01 '10 at 04:44
  • nope, git fsck --full doesn't return anything at all – Boris Churzin Feb 01 '10 at 09:35
  • 1
    This was a lifesaver. Mental note: always add potentially huge *.log files to .gitignore. Went from a 800mb repo to 6mb after this. – JackCA Aug 18 '10 at 21:20
  • 1
    step 2 and 3 in one `for i in \`git verify-pack -v .git/objects/pack/#{pack-name}.idx | sort -k 3 -n | tail -5\` ; do git rev-list --objects --all | grep $(echo $i | sed 's/ .*//g') ; done` – geermc4 Jan 08 '13 at 01:31

8 Answers8

67

I can't say for sure without access to your repository data, but I believe there are probably one or more packed refs still referencing old commits from before you ran git filter-branch. This would explain why git fsck --full --unreachable doesn't call the large blob an unreachable object, even though you've expired your reflog and removed the original (unpacked) refs.

Here's what I'd do (after git filter-branch and git gc have been done):

1) Make sure original refs are gone:

rm -rf .git/refs/original

2) Expire all reflog entries:

git reflog expire --all --expire='0 days'

3) Check for old packed refs

This could potentially be tricky, depending on how many packed refs you have. I don't know of any Git commands that automate this, so I think you'll have to do this manually. Make a backup of .git/packed-refs. Now edit .git/packed-refs. Check for old refs (in particular, see if it packed any of the refs from .git/refs/original). If you find any old ones that don't need to be there, delete them (remove the line for that ref).

After you finish cleaning up the packed-refs file, see if git fsck notices the unreachable objects:

git fsck --full --unreachable

If that worked, and git fsck now reports your large blob as unreachable, you can move on to the next step.

4) Repack your packed archive(s)

git repack -A -d

This will ensure that the unreachable objects get unpacked and stay unpacked.

5) Prune loose (unreachable) objects

git prune

And that should do it. Git really should have a better way to manage packed refs. Maybe there is a better way that I don't know about. In the absence of a better way, manual editing of the packed-refs file might be the only way to go.

Dan Moulding
  • 211,373
  • 23
  • 97
  • 98
  • 1
    Yey!!! I love you ! The problem was in packed-refs file, there was refs/remotes/origin/master from times I was backing it up on some server... once I removed it it all began to disappear... Thank you! (updating the question body with the full solution) – Boris Churzin Feb 02 '10 at 00:43
17

I'd recommend using the BFG Repo-Cleaner, a simpler, faster alternative to git-filter-branch specifically designed for rewriting files from Git history. One way in which it makes your life easier here is that it actually handles all references by default (all tags, branches, stuff like refs/remotes/origin/master, etc) but it's also 10-50x faster.

You should carefully follow these steps here: http://rtyley.github.com/bfg-repo-cleaner/#usage - but the core bit is just this: download the BFG's jar (requires Java 6 or above) and run this command:

$ java -jar bfg.jar  --delete-files file_name  my-repo.git

Any file named file_name (that isn't in your latest commit) will be will be totally removed from your repository's history. You can then use git gc to clean away the dead data:

$ git gc --prune=now --aggressive

The BFG is generally much simpler to use than git-filter-branch - the options are tailored around these two common use-cases:

  • Removing Crazy Big Files
  • Removing Passwords, Credentials & other Private data

Full disclosure: I'm the author of the BFG Repo-Cleaner.

Roberto Tyley
  • 24,513
  • 11
  • 72
  • 101
  • Does this also clean private data from remote repos after pushing? – Thomas Lauria Jul 23 '13 at 06:20
  • @ThomasLauria yup, the same cleaned refs are pushed to remote repos on pushing - the instructions at http://rtyley.github.io/bfg-repo-cleaner/#usage should cover it. If you have control over the remote repo, you can also run "git gc --prune=now --aggressive" on it after pushing to ensure dead objects are immediately removed from that also. – Roberto Tyley Jul 23 '13 at 08:11
  • @RobertoTyley This can result in two commits that appear after each other in the history and that have the same tree (if one of these commits only added the deleted file(s)). Do you know an easy way to remove such commits from the commit history, as they seem artificial? – user44400 Apr 19 '18 at 09:54
  • @RobertoTyley I think that concerns another issue. Only one repository is involved in the case I described. But `git filter-branch --prune-empty` seems to be the solution to my question (though using another tool, please let me know if the BFG Repo-Cleaner can do the same). – user44400 Apr 19 '18 at 12:32
6

I found this to be quite helpful with regards to removing a whole folder as the above didn't really help me: https://help.github.com/articles/remove-sensitive-data.

I used:

git filter-branch -f --force \
--index-filter 'git rm -rf --cached --ignore-unmatch folder/sub-folder' \
--prune-empty --tag-name-filter cat -- --all

rm -rf .git/refs/original/
git reflog expire --expire=now --all
git gc --prune=now
git gc --aggressive --prune=now
Mike Averto
  • 655
  • 1
  • 10
  • 16
5

I was trying to get rid of a big file in the history, and the above answers worked, up to a point. The point is: they don't work if you have tags. If the commit containing the big file is reachable from a tag, then you would need to adjust the filter-branches command thusly:

git filter-branch --tag-name-filter cat \
--index-filter 'git rm --cached --ignore-unmatch huge_file_name' -- \
--all --tags
BHMulder
  • 1,444
  • 1
  • 11
  • 5
4

This should be covered by the git obliterate command in Git Extras (https://github.com/visionmedia/git-extras).

git obliterate <filename>
Spain Train
  • 5,890
  • 2
  • 23
  • 29
2

See: How do I remove sensitive files from git’s history

The above will fail if the file does not exist in a rev. In that case, the '--ignore-unmatch' switch will fix it:

git filter-branch -f --index-filter 'git rm --cached --ignore-unmatch <filename>' HEAD

Then, to get all loose objects out of the repostiry:

git gc --prune='0 days ago'
Community
  • 1
  • 1
Wayne Conrad
  • 103,207
  • 26
  • 155
  • 191
  • Yep, tried this one, still have the files in the pack, and the size didn't change too much... – Boris Churzin Jan 29 '10 at 22:53
  • I just made a git sandbox and tried it. No good here, either. Let's see what I can figure out. – Wayne Conrad Jan 30 '10 at 01:07
  • The one in the answer? :) It's the same as I posted, and it still leaves the file in the pack... try a git sandbox, doing git gc so it will pack the file, and then running this... – Boris Churzin Jan 30 '10 at 12:46
  • Oh, the loose objects? See above. I'd be inclined to just let them be garbage collected in two weeks (the default for gc); killing _all_ loose objects is like emptying the trash--I lose any opportunities to get back anything I accidentally deleted. – Wayne Conrad Jan 30 '10 at 15:18
  • :) tried this one too... got rid of some of the files, but the biggest are still there... – Boris Churzin Jan 30 '10 at 22:36
  • Drats. I thought that would do it. Do the files exist in any other branches? – Wayne Conrad Jan 30 '10 at 23:06
  • Have no other branches :) But I think that it might be that I moved the file from one dir to another once... I run the filter-branch on both paths, but that doesn't help... – Boris Churzin Jan 31 '10 at 13:06
2

You have various reasons for a still large git repo size after git gc, since it does not remove all loose objects.

I detail those reasons in "reduce the git repository size"

But one trick to test in your case would be to clone your "cleaned" Git repo and see if the clone has the appropriate size.

(' "cleaned" repo ' being the one where you did apply the filter-branch, and then gc and prune)

Community
  • 1
  • 1
VonC
  • 1,262,500
  • 529
  • 4,410
  • 5,250
  • Yep, tested it already, and tested it again now, it reduced repository by 2k :) and the files are still there... – Boris Churzin Feb 01 '10 at 10:06
  • What's weird is `git count-objects -v -> count: 0, size: 0, in-pack: 10021, packs: 1, size-pack: 244547, prune-packable: 0, garbage: 0` but: `git clone test1 test2 -> Checking out files: 100% (8509/8509), done` – Boris Churzin Feb 01 '10 at 10:11
1

I had the same problem and I found a great tutorial on github that explain step by step how to get rid of files you accidentally committed.

Here is a little summary of the procedure as Cupcake suggested.

If you have a file named file_to_remove to remove from the history :

cd path_to_parent_dir

git filter-branch --force --index-filter \
  'git rm --cached --ignore-unmatch file_to_remove' \
  --prune-empty --tag-name-filter cat -- --all
Cyril Leroux
  • 2,599
  • 1
  • 26
  • 25
  • 1
    Link only answers are highly discouraged on Stack Overflow, because if the link breaks in the future, then the answer becomes useless. Please consider summarizing the relevant information contained in the link in your answer. –  Apr 04 '14 at 00:05