20

I have a project for analyzing the images. Test data for this project - about 15 GB of images. Question: where to store such test data, given that the need to store multiple versions and most part of developers needs this data?

In the same repository as the code? In a separate repository with external reference?

Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
  • Are files large, or is it just many "normal" files? In first case consider the use of [git-annex](http://git-annex.branchable.com/) – CharlesB Apr 23 '12 at 08:36
  • In this project, in test data only images about 3 Mb per file. But in another project, we work with video, and test data contains video files with size about 400-600 Mb. – Alexander Kholodovitch Apr 23 '12 at 08:47

4 Answers4

13

I would agree with the other answers that it's a bad idea to keep this much test data in your repository. There are a couple of systems, however, that let you conveniently refer to (and download) large data from outside your git repository:

I'm afraid that I haven't used either for any serious purpose myself, but they sound like plausible solutions to what you want.

Niko Föhr
  • 28,336
  • 10
  • 93
  • 96
Mark Longair
  • 446,582
  • 72
  • 411
  • 327
3

If these images are only needed by developers or people wanting to run the tests, I would possibly put them in a submodule since they seem to be quite sizeable...

Michael Wild
  • 24,977
  • 3
  • 43
  • 43
3

You need to store them in a separate referential, more adapted to those kind of files.

Use an artifact repository like Nexus as proposed here.
Add in your DVCS repo the scripts necessary to get from Nexus the right versions.

That way, you clone quickly and easily your sources, and you download the binaries from the second referential when needed.

Community
  • 1
  • 1
VonC
  • 1,262,500
  • 529
  • 4,410
  • 5,250
2

There are many options, however you should take care to properly integrate any solution into your git repository.

Git is revision control system, more precisely it stores a list of snapshots of your project. Each snapshot represents your project at a certain point in time.

Whatever solution you choose, it should be possible for your project to extract test data as it existed at any snapshot in the project history.

For example, if each image exists permanently at a fixed URL, your git project can simply store a text file with all the URLs. At runtime, have a script fetch each image. As your project evolves and images are added or removed from the test set, do not alter the existing URL scheme. Update the pointer file and commit that.

Another good idea might be to record the md5 or sha1 hashes of the images at each URL. Your download script should do a comparison to check at runtime, thus you can be alerted to any inconsistencies.

Jacob Groundwater
  • 6,581
  • 1
  • 28
  • 42