0

Let's say one wants to create a 1TB git repository in which data is updated here and there regularly. Whether that's a good idea is not part of the question, it probably isn't, so let's define it as a prerequisit. It's 1TB of ascii data in the working directory, but how it's chunked into files (a lot of tiny ones, fewer larger ones etc) can be chosen arbitrarily.

Git with large files gives some good info on that, in particular that for large files xdelta seems to become the bottleneck, while for a huge number of files git gc seems to become the problem (though that answer is from 2013, so it may well be outdated).

Git LFS or VFS for Git is not to be employed, the data should be revisioned and contained in the repository.

I'm definitely going to run some tests, but my question is whether anyone has experience values with this and could recommend (or make an educated guess) in what range an optimal filesize per chunk might be found?

matthias_buehlmann
  • 4,641
  • 6
  • 34
  • 76

2 Answers2

1

Your question is operating system specific and file system specific. BTW, both Bismon and RefPerSys are managing data in textual files, git-versioned, like you do.

Your question is related to this one.

On a Linux system, I would recommend to:

  • have files of more than 64 kilobytes each and less than a dozen megabytes each (since very small files, or very big ones, could be handled less efficiently).
  • have directories of less than a thousand inode(7)-s each, but on average with at least a dozen of inodes (including subdirectories). Otherwise path_resolution(7) could be slowed, and glob(7)-ing in your Unix shell can become an issue (for humans). So prefer a file tree like data/dir01/file0102 .... data/dir99/file34356 to a single directory containing a million of inodes.
  • choose carefully your file system (Ext4, XFS, etc...)
  • backup your data: both GNU tar and afio (capable of compressing individual files) have limitations.

You could also contact researchers in bioinformatics. They probably have the same issue as you do.

You could benchmark yourself, by generating arbitrary data. My manydl.c program is able to generate an arbitrary amount of C files (compiled into dlopen(3)-ed plugins), you could adapt it for your benchmarking purposes.

The choice of the hardware (SSD or rotating disk) is also relevant.

For critical data, I would also suggest to keep (at backup time) some hash code (e.g. md5 or better) of every textual file.

Basile Starynkevitch
  • 223,805
  • 18
  • 296
  • 547
0

This sounds very much like bup's target use case. It works by pattern-matching block boundaries (low 13 bits of a rolling checksum of 64 bytes 0x1fff, netting on-average-8kb blocks) so it looks possible to maliciously or spectacularly-unluckily produce very small or very large blocks, but those won't actually break it.

jthill
  • 55,082
  • 5
  • 77
  • 137