Version control for large binary files and >1TB repositories?

Question

Sorry to come up with this topic again, as there are soo many other questions already related - but none that covers my problem directly.

What I'm searching is a good version control system that can handle only two simple requirements:

store large binary files (>1GB)
support a repository that's >1TB (yes, that's TB)

Why? We're in the process of repackaging a few thousand software applications for our next big OS deployment and we want those packages to follow version control.

So far I've got some experience with SVN and CVS, however I'm not quite satisfied with the performance of both with large binary files (a few MSI or CAB files will be >1GB). Also, I'm not sure if they scale well with the amount of data we're expecting in the next 2-5 years (like I said, estimated >1TB)

So, do you have any recommendations? I'm currently also looking into SVN Externals as well as Git Submodules, though that would mean several individual repositories for each software package and I'm not sure that's what we want..

You sure you want a version control system? That means after every minor change to any >1GB binary file means having a >1GB copy somewhere on the disk of the old version of that file. You might consider using a database instead, since many databases support blob formats which allow you to save it on the disk rather than internal to the database (much faster that way). — Neil, Mar 08 '11 at 15:19
You also can consider Git with git-lts: see [my answer below](http://stackoverflow.com/a/29530927/6309) — VonC, Apr 09 '15 at 06:04
@Neil Wrong. For example, Subversion supports binary diffs by design and won't create a 1GB copy of 1GB file for every minor change. — bahrep, Oct 06 '16 at 21:12

score 12 · Answer 1 · edited Jul 13 '23 at 13:50

12

Take a look at Boar, "Simple version control and backup for photos, videos and other binary files". It can easily handle huge files and huge repositories.

edited Jul 13 '23 at 13:50

blek__

98
1
7

answered Mar 16 '11 at 16:33

Mats Ekberg

1,695
4
15
23

1

Does Boar support setting up of a remote repo for sharing files among peers? – Harvey Lin Sep 28 '16 at 21:05
2

Looks great! Since you're obviously the guy behind Boar, could you tell me what I should expect as likely differences between Boar and e.g. Plastic SCM from another answer (https://stackoverflow.com/a/29221311/1599699)? I would like to have a Windows + Linux supported, version-controlled, diff-able, incremental, large + binary file supporting backup solution, and so far these look like the best two options. – Andrew Oct 24 '17 at 05:35
Boar doesn't seem to have been updated since 2017. Is it's name a play on the similar Borg project at: https://www.borgbackup.org? – user2023370 Apr 01 '19 at 14:43

score 5 · Answer 2 · answered Jun 16 '16 at 11:15

5

Old question, but perhaps worth pointing out that Perforce is in use at lots of large companies, and particular in games development companies, where multi-Terabyte repositories with many large binary files.

(Disclaimer: I work at Perforce)

answered Jun 16 '16 at 11:15

Robert Cowham

402
4
6

1

Perforce stores binary files without delta-encoding which makes it totally useless for big repositories. If I have for example a 5GB file and change it only slightly (for example append), Perforce adds each change as a whole file. 2 changes = 1GB space. For data versioning this quickly ends up in a scenario where it's basically impossible to add further versions. Storing deltas is essential for big data repositories. – Marc J. Schmidt Apr 07 '20 at 20:20

bahrep · Answer 3 · 2016-06-16T11:34:02.647

store large binary files (>1GB)

support a repository that's >1TB (yes, that's TB)

Yep, that is one of the cases Apache Subversion should fully support.

So far I've got some experience with SVN and CVS, however I'm not quite satisfied with the performance of both with large binary files (a few MSI or CAB files will be >1GB). Also, I'm not sure if they scale well with the amount of data we're expecting in the next 2-5 years (like I said, estimated >1TB)

Up-to-date Apache Subversion servers and clients should have no problems controlling such amount of data and they perfectly scale. Moreover, there are various repository replication approaches that should improve performance in case you have multiple sites with developers working on the same projects.

I'm currently also looking into SVN Externals as well as Git Submodules, though that would mean several individual repositories for each software package and I'm not sure that's what we want..

svn:externals have nothing to do with the support for large binaries or multiterabyte projects. Subversion perfectly scales and supports very large data and code base in a single repository. But Git does not. With Git, you'll have to divide and split the projects to multiple small repositories. This is going to lead to a lot of drawbacks and a constant PITA. That's why Git has a lot of add-ons such as git-lfs that try to make the problem less painful.

VonC · Answer 4 · 2018-04-24T18:14:43.380

2

Update May 2017:

Git, with the addition of GVFS (Git Virtual File System), can support virtually any number of files of any size (starting with the Windows repository itself: "The largest Git repo on the planet" (3.5M files, 320GB).
This is not yet >1TB, but it can scale there.

The work done with GVFS is slowly proposed upstream (that is to Git itself), but that is still a work in progress.
GVFS is implement on Windows, but will soon be done for Mac (because the team at Windows developing Office for Mac demands it), and Linux.

April 2015

Git can actually be considered as a viable VCS for large data, with Git Large File Storage (LFS) (by GitHub, april 2015).

git-lfs (see git-lfs.github.com) can be tested with a server supporting it: lfs-test-server (or directly with github.com itself):
You can store metadata only in the git repo, and the large file elsewhere.

edited Apr 24 '18 at 18:14

answered Apr 09 '15 at 06:03

VonC

1,262,500
529
4,410
5,250

2

How is this different than git-annex? It says that "The contents of 'annexed' files are not stored in git, only the names of the files and some other metadata remain there." Sounds about the same as git for large files? – Harvey Lin Sep 28 '16 at 18:20
The principle is the same, the implementation, based on content filtering, is quite different. – VonC Sep 28 '16 at 18:56
Can you elaborate on what you meant by "content filtering" please? – Harvey Lin Sep 28 '16 at 21:04
1

@HarveyLin I sure can. I sure have more than a year ago: http://stackoverflow.com/a/29531702/6309. I actually don't like this mechanism. The one I would love was presented in 2013: http://stackoverflow.com/a/17897705/6309. Instead, lfs and bup are the current implementations: http://stackoverflow.com/a/19494211/6309. But the idea of a custom backend is not dead and is currently being implemented within git itself! libgit2 is a first example: http://stackoverflow.com/a/36125713/6309. https://github.com/git/git/blob/master/Documentation/RelNotes/2.6.0.txt#L96-L100 prepares the way! – VonC Sep 28 '16 at 21:50
Does git-lfs loads file deltas into memory like git does? It if does then it wouldn't work for me since I am using it for disk image files that goes on 6GB a piece. – Harvey Lin Sep 28 '16 at 23:36
@HarveyLin no delta that I know of: LFS is a quick hack based on existing git features for filtering data in and out of a git repo. – VonC Sep 29 '16 at 06:12
Thanks VonC, I am working on my IT people to set this up for I can track those large files that's gonna be in my git repo. Since git-lfs only track pointer files, where do I set up the storage for the actual files? – Harvey Lin Oct 03 '16 at 18:28
@HarveyLin the idea is to store them in a large capacity storage (we are using NAS (http://searchstorage.techtarget.com/definition/network-attached-storage) but plenty of other solutions exist. – VonC Oct 03 '16 at 20:43
Thanks VonC, can you show me a couple of examples of those solutions if you don't mind, thanks! – Harvey Lin Oct 04 '16 at 22:48

HardCode · Accepted Answer · 2011-03-08T15:52:22.920

2

Version control systems are for source code, not binary builds. You are better off just using standard network file server backup tapes for binary file backup - even though it's largely unnecessary when you have source code control since you can just rebuild any version of any binary at any time. Trying to put binaries in source code control is a mistake.

What you are really talking about is a process known as configuration management. If you have thousands of unique software packages, your business should have a configuration manager (a person, not software ;-) ) who manages all of the configurations (a.k.a. builds) for development, testing, release, release-per-customer, etc.

edited Mar 08 '11 at 15:52

answered Mar 08 '11 at 15:22

HardCode

6,497
4
31
54

2

i agree on the principles. i'm not a big fan of using versioning systems that are built for code as pimped fileshares for binaries either. the problem with configuration management is that it relies on human review or at least human driven process management - which is not so easy to raise awareness for. the perks that come with a versioning system (changelog, easy rss access etc.) are nonexistant on a simple fileshare. actually, i was hoping to introduce structure through technology, rather than organisation - because the organisational approach has failed countless times in my company – Christoph Voigt Mar 08 '11 at 17:42
Yes, configuration management does rely on human review. But that doesn't mean it can't be supported by software, however, such as a configuration database that tracks, for example, which build a customer currently uses. As for a technology for storing builds, using a file server is exactly what a file server is for. – HardCode Mar 09 '11 at 16:35
43

It is common, yet not correct to limit the definition of version control to source code. There is a realistic need for versioning and branching with practically the same use cases that source code brings up; minus all use cases that use intra-file analysis. The fact that SVN, GIT, HG & the gang do not handle this use case in a usable manner is not a reason to reject the need. – Paul Jun 28 '13 at 10:05
2

Saying "you can just rebuild any version of any binary at any time" overlooks guaranteeing that you have *exactly* the same tools (compiler, linker, etc.) and configuration so that the source code will be transformed to binaries in *exactly* the same way. Which brings us back to the OP's original question of how do you version control large binary files (such as g++.exe)? – Technophile Jul 20 '16 at 18:35
2

binaries in a version control system is not necessarily a bad idea. If you need it and cannot create it from source you may have to. Reproducible binaries shouldn't. – Thorbjørn Ravn Andersen Sep 10 '16 at 08:57
6

Downvoted because your answer is just wrong. Version control is a generic concept. It could very well be for binary files too. – Andrew Oct 24 '17 at 05:16
For instance, see Mats's answer below: https://stackoverflow.com/a/5328485/1599699 So there's Boar, and I've heard of bup, rsync, ... (depending on how we define VCS's). – Andrew Oct 24 '17 at 05:28
1

The OP was clearly talking about source code compiled into applications, not things such as photos and videos, etc, which would indeed have a place with version control in something like Boar. Keep in mind this question is 6 years old, and things have changed over time. – HardCode Oct 24 '17 at 15:57

score 2 · Answer 6 · answered Mar 08 '11 at 15:24

When you really have to use a VCS, i would use svn, since svn does not require to copy the entire repository to the working copy. But it still needs about the duplicate amount of disk space, since it has a clean copy for each file.

With these amount of data I would look for a document management system, or (low level) use a read-only network share with a defined input process.

score 1 · Answer 7 · answered Mar 23 '15 at 22:03

1

This is an old question, but one possible answer is https://www.plasticscm.com/. Their VCS can handle very large files and very large repositories. They were my choice when we were choosing a couple years ago, but management pushed us elsewhere.

answered Mar 23 '15 at 22:03

gregsohl

41
6

score 1 · Answer 8 · answered Mar 08 '11 at 16:17

1

You might be much better off by simply relying on some NAS device that would provide a combination of filesystem-accessible snapshots together with single instance store / block level deduplication, given the scale of data you are describing ...

(The question also mentions .cab & .msi files: usually the CI software of your choice has some method of archiving builds. Is that what you are ultimately after?)

answered Mar 08 '11 at 16:17

conny

9,973
6
38
47

If there was a CI software that would not be aimed at software development but packaging, it would make my life easier :) There are a few CI-isch influences in applications for for software packaging (like AdminStudio), the problem is that they expect you to be the actual developer of the software and have everything related to your package in one place and more importantly one technology. They're very, **very** limited when you look at them closely... – Christoph Voigt Mar 08 '11 at 18:35
1

So you are specifically *not* looking for a build server, then what you are looking is probably some sort of more general system supporting **digital asset management**. – conny Mar 09 '11 at 00:02

score 0 · Answer 9 · answered Mar 09 '11 at 15:30

There are a couple of companies with products for "Wide Area File Sharing." They can replicate large files to different locations, but have distributed locking mechanisms so only one person can work on any of the copies. When a person checks in an updated copy, that is replicated to the other sites. The major use is CAD/CAM files and other large files. See Peer Software (http://www.peersoftware.com/index.aspx) and GlobalSCAPE (http://www.globalscape.com/).

score 0 · Answer 10 · answered Mar 16 '11 at 17:12

The perks that come with a versioning system (changelog, easy rss access etc.) are nonexistant on a simple fileshare.

If you only care about the versioning metadata features and don't actually care about the old data then a solution that uses a VCS without storing the data in the VCS may be an acceptable option.

git-annex is the first one that came to my mind, but from the what git-annex is not page it seems there are other similar but not exactly the same alternatives.

I have not used git-annex, but from the description and walkthrough it sounds like it could work for your situation.

Version control for large binary files and >1TB repositories?

10 Answers10

Linked