2

I'm a bioinformatician currently extracting normal-sized sequences from genomic files. Some genomic files are large enough that I don't want to put them into the main git repository, whereas I'm putting the extracted sequences into git.

Is it possible to tell git "Here's a large file - don't store the whole file, just take its checksum, and let me know if that file is missing or modified."

If that's not possible, I guess I'll have to either git-ignore the large files, or, as suggested in this question, store them in a submodule.

Community
  • 1
  • 1
Andrew Grimm
  • 78,473
  • 57
  • 200
  • 338

3 Answers3

6

I wrote a script that does this sort of thing. You put file patterns in the .gitattributes file for large media that you don't want going in your git repo and it can store them on S3 instead. It's just a starting point, but I think it's usable if you're interested.

http://github.com/schacon/git-media

Maybe that will help you, or at least show you how it could be done and you can customize it for your specific needs.

Scott Chacon
  • 2,756
  • 19
  • 14
  • Amazon S3 wouldn't be an option for me (we're a little nervous about giving data to third parties). Are you planning on options that don't use third parties at some stage? – Andrew Grimm Oct 01 '09 at 23:20
  • 1
    @Andrew: I modded the script to support storing files via SCP on your own private server, instead of on S3. Or you can store the files on a mapped network drive. Also I sped it up a bit. See here http://github.com/davr/git-media – davr Jul 19 '10 at 17:42
2

In the upcoming release of git there would be 'refs/replace/' mechanism, which I think could be adapted for such purpose (assuming that the number of such large-media files and the number of its version isn't very large.)

In the slim fork of your project you would have (like Seth wrote) 'stub' files in place of your large media files, which as contents would have SHA-1 of a blob of large file (from "git hash-object -t blob <filename>").

Then in full fork of your project you would use "refs/replace/" mechanism to replace those 'stub' files by true contents (using git replace). Some hooks would be required to keep SHA-1 in 'stub' files in sync with actual large-media files.

Then if you want full clone, you fetch also from "refs/replace/" namespace; if you want slim clone, you don't fetch "refs/replace/".

Note: I haven't actually tested such setup; also this isn't yet available in git, unless you run 'master'

Community
  • 1
  • 1
Jakub Narębski
  • 309,089
  • 65
  • 217
  • 230
  • Very cool! I didn't know about this. Where does one get such information? The git mailinglist, Junio's blog? Is there some kind of an announcement service, "this week in git.git" or something like Jon Masters' daily LKML summary podcast? I find that it is sometimes hard to follow new features in Git, e.g. what's up with git-notes? – Jörg W Mittag Oct 01 '09 at 17:12
  • I watch git mailing list, so it how I know. You can watch for RelNotes instead; the information about `refs/replaces/` is in http://git.kernel.org/?p=git/git.git;a=blob;f=Documentation/RelNotes-1.6.5.txt (so they are in git version 1.6.5; my mistake) – Jakub Narębski Oct 01 '09 at 18:51
  • Errr... git version 1.6.5 is the **next** version to be released (as of 01-10-2009) – Jakub Narębski Oct 01 '09 at 18:52
  • Also Junio C Hamano is submitting "What's in git.git ..." and "What's cooking in git.git ..." messages quite regularly; you can read them in RSS format thanks to http://gitrss.q42.co.uk (select "status" feed) – Jakub Narębski Oct 01 '09 at 18:56
  • Is it spelt 'refs/replace', rather than 'refs/replaces'? Also, is documentation for this command available yet? – Andrew Grimm Oct 01 '09 at 23:35
1

How about storing the hashes in a text file, then giving the text file to git? Then you could write a hook that compared hashes, so every time you checked in or checked out, you could be notified of what was missing / different.

Not exactly what you want, and you would still have to maintain the text file manually.

Seth
  • 45,033
  • 10
  • 85
  • 120