1

I have a rather large repository (11 GB, 900,000+ files) and having trouble with iterating within reasonable time. After a bit of profiling, the real bottleneck seems to be git update-index:

$ time git update-index --replace $path > /dev/null

real    0m5.766s
user    0m1.984s
sys     0m0.391s

That makes an unbearable number of days to get the list of files. Is there any way to speed the update-index operation up?

For what it's worth, I'm running cygwin on Windows 7.

EDIT: To put more context to the question.

The large repository comes from an SVN import, and contains a number of binaries that shouldn't be in the repository. However, I want to keep the commit history and commit logs. In order to do that, I'm trying to replace the contents of the binaries with file hashes, which should compact the repository and allow me to retain history.

Roberto Tyley
  • 24,513
  • 11
  • 72
  • 101
webmage
  • 71
  • 8
  • That's a known problem, indeed the big problem of git is its speed. One solution is to split your project (because well 900,000 files is a huge lot) I think it's a duplicate : http://stackoverflow.com/questions/3313908/git-is-really-slow-for-100-000-objects-any-fixes – Vince Aug 29 '12 at 13:34
  • @Vince Well, speed at massive repo scale. I'd still much rather issue `git branch` then `svn copy` any day on nearly any repo. – Christopher Aug 29 '12 at 13:44
  • @Christopher yeah yeah I'm not saying it's a showstopper. I tried it on a repo with 10,000 files and it worked pretty well. Anyway there are features like submodule if it's a problem. In this precise case, with 900,000 files, I believe Git is not the only one to be confused ... – Vince Aug 29 '12 at 13:46
  • @Vince sub-repositories are a solid idea. webmage, do the files change often or are huge portions of them static most of the time? – Christopher Aug 29 '12 at 13:49
  • @webmage are you sure it's a good idea to have binaries at all? I mean isn't it possible to generate those from the sources? and in this case, maybe you should just not index them. – Vince Aug 29 '12 at 14:06
  • @Christopher the files in question are mostly static and carried around for legacy reasons. I've added some additional context to the question to clarify the situation. Not sure if a batch script might be more effective - passing in multiple paths to update-index is slower than passing individual paths, but perhaps there's a better solution I'm missing. – webmage Aug 29 '12 at 14:06
  • @webmage did you try `git update-index --assume-unchanged `as stated in the linked page I gave? you can also unindex some folders, or use gitignore feature : http://www.kernel.org/pub/software/scm/git/docs/gitignore.html – Vince Aug 29 '12 at 14:13
  • @Vince as mentioned, the binaries need to go. They shouldn't be in the repository. But I need the history on when which binaries were committed with which message. – webmage Aug 29 '12 at 14:18
  • @Vince I'm rewriting the file contents with a hash value before running update-index. Correct me if I'm wrong, but that would make --assume-unchanged redundant. – webmage Aug 29 '12 at 14:23

1 Answers1

3

You want to use the BFG Repo-Cleaner, a faster, simpler alternative to git-filter-branch specifically designed for removing large files from Git repos.

Download the BFG jar (requires Java 6 or above) and run this command:

$ java -jar bfg.jar  --strip-blobs-bigger-than 1MB  my-repo.git

Any files over 1MB in size (that aren't in your latest commit) will be removed from your Git repository's history, and replaced with a .git-id file that contains the old Git hash-id of the original file (which matches the replace contents of the binaries with file hashes requirement of the question).

You can then use git gc to clean away the dead data:

$ git gc --prune=now --aggressive

The BFG is typically 10-50x faster than running git-filter-branch and the options are tailored around these two common use-cases:

  • Removing Crazy Big Files
  • Removing Passwords, Credentials & other Private data

Full disclosure: I'm the author of the BFG Repo-Cleaner.

Roberto Tyley
  • 24,513
  • 11
  • 72
  • 101