15

Here is the situation. Ad-hock analytic repository with a directory per each individual analysis. Each directory contains a script(s) connected with one or more data files that come in different formats and are of different (sometimes considerable) size. Scripts without data are generally useless so we would like to store data files. On the other hand sometimes it's useful to look at the script without being forced to download associated data files(s) (to determine how some analysis were conducted).

We definitely don't want to store data on a separated repository (runtime issues, associating scripts with data files etc.)

What was analyzed:

  • git submodules - separated repo, everything will be kept away from the scripts (not in same directories so it'd get messy over time)
  • git hooks - intended rather for applying constraints or additional actions for push request and as was stated above - everyone should be able to upload any file (besides: we don't have access to apply sever side hooks)

The idea that comes to me is that it would be convenient to exclude some locations or certain files (i.e. >> 50 MB) from being pulled or cloned from repository. Just not to transfer unwanted data. Is it possible?

If some files are not touched over subsequent commits they are not necessary from the perspective of future pushes. Probably (or even for sure) I'm lacking certain knowledge about underlying mechanisms of git. I would be thankful for clarification.

iku
  • 1,007
  • 2
  • 10
  • 23

2 Answers2

23

git clone --no-checkout --filter=blob:limit=100m

This should allow fetching only files smaller than a given size when servers finally implement it.

Then you have to checkout all files but the big ones. A simple strategy that might be to do something along git rev-list --filter=blob:limit=100m --objects HEAD | xargs ....

TODO I haven't managed to make it work yet. Here's a good test repository https://github.com/cirosantilli/test-git-partial-clone-big-small-no-bigtree with some very large and some very small files:

If I run:

git clone --no-checkout --filter=blob:limit=10k https://github.com/cirosantilli/test-git-partial-clone-big-small-no-bigtree
git rev-list --filter=blob:limit=100m --objects HEAD

then the rev-list itself starts downloading all files, including the big ones, so it doesn't work as desired.

At: How do I clone a subdirectory only of a Git repository? I describe how to download only a specific directory, and that method is working fine however.

git LFS

This is a solution that can already be used on GitHub and GitLab.

You just track your large blobs in LFS, and then clone without LFS How to clone/pull a git repository, ignoring LFS?

GIT_LFS_SKIP_SMUDGE=1 git clone SERVER-REPOSITORY

and finally manually pull any missing LFS files that you may want: https://github.com/git-lfs/git-lfs/issues/1351

git lfs pull --include "*.dat"
Ciro Santilli OurBigBook.com
  • 347,512
  • 102
  • 1,199
  • 985
1

Git sparse checkout lets you set subdirs to checkout or not, etc. I don't think it can do it based on anything else (e.g. size) though AFAIK.

Michael
  • 3,639
  • 14
  • 29
  • as I understood there still exist a problem of git clone downloading all the data, am I right? – iku Oct 13 '15 at 16:10
  • Yes, it's not much use I'm afraid if you're cloning from scratch every time. Had a quick search around and, perhaps you already found it, but I found this page that is actually quite comprehensive on the subject of over sized git repos being a pain, for various reasons, and with various suggestions to fix/help: http://blogs.atlassian.com/2014/05/handle-big-repositories-git/ It has some extra suggestions and info you might find helpful. – Michael Oct 13 '15 at 16:35