2

After creating a repository containing some binary files (yes git indeed doesn't handle binary files that well, but this is a repository where the binaries are mandatory files), performing a commit becomes kind of bloated.

When one performs a commit the memory usage of git reaches 2.7 GiB. Sometimes the process is even killed by the operating system because it uses all remaining system resources.

This is probably due to the internally used diff algorithm that requires to take both the original and the new file into account and needs to take at least one of the files into memory (the second can be handled as a stream).

Is it possible to mark a file as binary and specify that the repository doesn't need to calculate the difference, but only check for a new version (this can be done by handling both files as streams, thus in constant memory). After all, the storing the difference is probably as inefficient as copying the new version.

git repositories are maintained on the machine automatically. It would thus be nice, if the process could be automated and thus use for instance the MIME-type of the files and mark all binary files automatically.

Willem Van Onsem
  • 443,496
  • 30
  • 428
  • 555
  • Not really an answer, but would `git gc` help here? – user14717 Jan 05 '15 at 13:37
  • Also not really an answer: How about [git-annex](http://git-annex.branchable.com/)? – musiKk Jan 05 '15 at 13:57
  • Are you currently committing a change to the binary file. If not, I can't see why Git would need to do a diff. Even if you, Git only records snapshots, so the only diff it would do would be to save disk space of similar blobs, which your binaries are probably not similar enough to even bother with. – Joseph K. Strauss Jan 05 '15 at 15:56

1 Answers1

2

As mentioned in "Exclude a directory from git diff", you can exclude files/folders from diff, with a .gitattributes directive '-diff':

lib/* -diff
dist/js/**/*.js -diff

To avoid any out of memory issue due to git diff, you also have since Git v2.2.0 (mid 2014) the configuration core.bigfilethreshold.
(And the default size for a pack file has been raised).

Finally, additional features like GVFS (Git Virtual File System, 2017) will improve that kind of issue, and already allows Microsoft to manage the largest Git repository on the planet (the Windows codebase one, approximately 3.5M files about 300GB, with 1,760 daily “lab builds” across 440 branches in addition to thousands of pull request validation builds). This capability is yet to be fully integrated to Git, but illustrates what is possible.

VonC
  • 1,262,500
  • 529
  • 4,410
  • 5,250