0

I've read at various internet resources that Git is handling large files not very well, also, Git seems to have problems with large overall repository sizes. This seems to have initiated projects like git-annex, git-media, git-fat, git-bigfiles, and probably even more...

However, after reading Git-Internals it looks to me, like Git's pack file concept should solve all the problems with large files.

Q1: What's the fuss about large files in Git?

Q2: What's the fuss about Git and large repositories?

Q3: If we have a project with two binary dependencies (e.g. around 25 DLL files with each around 500KB to 1MB) which are updated on a monthly basis. Is this really going to be a problem for Git? Is only the initial cloning going to be a long process, or is working with the repository (e.g. branch change, commits, pulling, pushing, etc.) going to be everyday problem?

D.R.
  • 20,268
  • 21
  • 102
  • 205
  • Possible duplicate: http://stackoverflow.com/questions/540535/managing-large-binary-files-with-git?rq=1 – Aaron Digulla Jun 24 '14 at 09:27
  • Not a duplicate, I'm interested in the backgrounds, not in the tools I've already mentioned in my question. – D.R. Jun 24 '14 at 09:32

1 Answers1

0

In a nutshell, today's computers are bad with large files. Moving megabytes around is pretty fast but gigabytes take time. Only specialized tools are ready to handle gigabytes of data and Git isn't one of those.

More related to Git: Git compares files all the time. If the files are small (a few KB), then these operations are fast. If they are huge, then git has to compare many, many bytes and that takes time, memory and nerves.

The projects which you list add special handling for large files, like saving them in individual blobs without trying to compare them to previous versions. That makes every day operations faster but at the cost of repository size. And Git needs free disk space in the order of the repo size for some operations or you'll get errors (and maybe a corrupted repo since this code is prone to be tested least).

Lastly, the initial clone will take a long time.

Regarding Q3: Git isn't a backup tool. You probably don't want to be able to get the DLL from ten years ago, ever.

Put the sources for those libraries under Git and then use a backup/release process to handle the binaries (like keeping the last 12 months worth on some network drive).

Aaron Digulla
  • 321,842
  • 108
  • 597
  • 820
  • (1) In which cases does Git compare files with each other (except when staging/committing files). I'm not going to diff those DLLs...so for everyday work everything is fine except the overall repository size (GIT compresses...) and the initial cloning time? (2) We do not backup those DLLs, they are binary references of a 3rd-party-library which is not available via NuGet or something similar. There are no sources available to us either. Is it a practical problem if you add 25 2-5MB DLL/PDB files to a Git repository? – D.R. Jun 24 '14 at 09:31
  • Your question says that you add 25MB every month to this repo. So after a year, it will be 300MB. The problem here is that you can't get rid of this bloat. The specialized tools allow you to delete binary data from a Git repo when you no longer need it. – Aaron Digulla Jun 24 '14 at 10:19
  • Maybe the answer here is: You can use Git but it's not well suited. You probably won't have big problems but some operations might be slow. I don't know enough about the internals of Git to tell you which ones specifically. So do you really need to be able to recover every single version of each DLL, even when it's several years old? – Aaron Digulla Jun 24 '14 at 10:21
  • as a side note: I have several multi-GB Git repos here. One contains huge installation files (10-500MB each). Those change very rarely (one a year). I'm aware of that there might be problems with my approach but at this time, I'm the only one who has to work with the repo and since I rarely change it, it doesn't matter when Git is slow (and so far, the performance was okay). But then, I have 1TB of free disk space and 16GB of RAM. – Aaron Digulla Jun 24 '14 at 10:23