33

I have a 190 MB plain text file that I want to track on github.

The text file is a pronounciation lexicon file for our text-to-speech engine. We regularly add and modify lines in the text files, and the diffs are fairly small, so it's perfect for git in that sense.

However, GitHub has a strict 100 MB file size limit in place. I have tried the GitHub Large File Storage service, but that uploads a new version of the entire 190 MB file every time it changes - so that would quickly grow to many gigabytes if I go down that path.

I would like to keep the file as one file instead of splitting it because that's how our workflow is currently and it would require some coding to allow multiple text files as input/output in our tools (and we don't have much development resources).

One idea I've had is that maybe it's possible to set up some pre- and post-commit hooks to split and concatenate the big file automatically? Would that be possible?

Other ideas?

Edit: I am aware of the 100 MB file size limitation described in the similar questions here on StackOverflow, but I don't consider my question a duplicate because I'm asking for the specific case where the diffs are small and frequent (I'm not trying to upload a big ZIP file or anything). However, my understanding is that git-lfs is only appropriate for files that rarely change, and that normal git would be the perfect fit for the kind of file I'm describing; except that GitHub has a file size restriction.

Update: I spent yesterday experimenting with creating a small cross-platform program that splits and joins files into smaller files using git hooks. It kind of works but not really satisfactory. You will need to have your big text file excluded by .gitignore, which makes git unaware about whether or not it has changed. The split files are not initially detected by git status or git commit and leads to the same issue as described in this SO question, which is quite annoying: Pre-commit script creates mysqldump file, but "nothing to commit (working directory clean)"? Setting up a cron job (linux) and scheduled task (windows) to automatically regenerate the split files regularly might fix that, but it's not easy to automatically set up, might cause performance issues on the users computer, and is just not a very elegant solution. Some hacky solutions like dynamically modifying .gitignore might also be needed, and in no way would you get a diff of the actual text files, only the split files (although that might be acceptable as they would be very similar).

So, having slept on it, today I think the git hook approach is not a good option after all as it has too many quirks. As has been suggested by @PyRulez, I think I'll have to look at other services than GitHub (unfortunately, since I love github). A hosted solution would be preferable to avoid having to manage our own server. I'd also like it to be publically available...

Update 2: I've looked at some alternatives to GitHub and currently I'm leaning towards using GitLab. I've contacted GitHub support about the possibility of raising the 100MB limit, but if they won't do that I'll just switch to GitLab for this particular project.

Community
  • 1
  • 1
josteinaj
  • 477
  • 1
  • 4
  • 11
  • 2
    Possible duplicate of [not able to push file more than 100mb to git hub](http://stackoverflow.com/questions/29586977/not-able-to-push-file-more-than-100mb-to-git-hub) – Mayuso Jan 11 '16 at 14:23
  • 2
    @Mayuso I know this sounds similar to other questions, but this question regards the specific case where I have a text file which has frequent but small diffs and if that makes it possible to work around the 100 MB limitation somehow. I understand binaries would not be possible. – josteinaj Jan 11 '16 at 14:49
  • 1
    I guess I did not understand the question well, already answered, sorry :) – Mayuso Jan 11 '16 at 14:57
  • No problem :), I should have been clearer. – josteinaj Jan 11 '16 at 15:00
  • Maybe use something besides gitHub? – PyRulez Jan 11 '16 at 21:58
  • @PyRulez I'm open for other suggestions if you know about other git services that allows me to track a 190 MB text file (although I kinda like having our Windows-users use GitHub Desktop). – josteinaj Jan 12 '16 at 12:25
  • @josteinaj Dropbox (look up Dropbox+git) – PyRulez Jan 12 '16 at 17:08
  • @josteinaj really, any file sharing thing would work (owncloud, bit torrent sync, etc...) – PyRulez Jan 12 '16 at 17:10
  • @josteinaj OW, and gitHub for windows and mac apparently works with any git repo, not just gitHub (according to [this](https://git-scm.com/book/en/v2/Git-in-Other-Environments-Graphical-Interfaces)), so you could have a git+dropbox+GitHub Desktop workflow!!! This [link](http://haacked.com/archive/2012/05/30/using-github-for-windows-with-non-github-repositories.aspx/) explains how. – PyRulez Jan 13 '16 at 02:08
  • @PyRulez yeah, that seems pretty cool. It won't work with pull requests etc. but I think we can do fine without that feature. Using Dropbox with git has some downsides. As pointed out [here](http://stackoverflow.com/questions/1960799/using-git-and-dropbox-together-effectively#comment2866900_1961515), it can cause synchronization errors if multiple users try to push at the same time (we are a small team, but I'd like to avoid it anyway). Also, it's not straight forward to make the repo public using Dropbox I think. The best would be to find a hosted git service which allows bigger files I think – josteinaj Jan 13 '16 at 10:29
  • @josteinaj I actually might have another solution (clean and smudge filters). If I have time, I'll write up an answer. – PyRulez Jan 13 '16 at 12:07
  • @josteinaj (As a side note, Dropbox makes it very easy to be a public repo. The only problem is the 2GB free limit (which will include your history.)) – PyRulez Jan 13 '16 at 12:20

3 Answers3

17

Clean and Smudge

You can use clean and smudge to compress your file. Normally, this isn't necessary, since git will compress it internally, but since gitHub is acting weird, it may help. The main commands would be like:

git config filter.compress.clean gzip
git config filter.compress.smudge gzip -d

GitHub will see this as a compressed file, but on each computer, it will appear to be a text file.

See https://git-scm.com/book/en/v2/Customizing-Git-Git-Attributes for more details.

Alternatively, you could have clean post to an online pastebin, and smudge fetch from the pastebin, such as http://pastebin.com/. Many other combinations are possible with clean and smudge.

PyRulez
  • 10,513
  • 10
  • 42
  • 87
  • 1
    Interesting solution, thanks! This might make the 190MB smaller than 100MB. I assume the gzipped files won't be diffable though so each time the file changes, a new file would be created. If gzip compresses from 190MB to maybe 50MB, that's still 50 new MB for every commit. – josteinaj Jan 14 '16 at 13:02
  • ...maybe if instead of gzipping, the files could be split as I attempted with git hooks earlier. I'm currently leaning towards switching to GitLab instead of GitHub though, so I'll let that be a future experiment. – josteinaj Jan 14 '16 at 13:05
  • 3
    @josteinaj see https://git-scm.com/book/en/v2/Customizing-Git-Git-Attributes#Binary-Files for how to properly diff them. – PyRulez Jan 14 '16 at 21:50
  • @josteinaj also, github for windows should work with gitlab, in case you were wondering. (Git is awesome.) – PyRulez Jan 14 '16 at 21:52
  • 1
    Interesting! Thanks :). Git is indeed awesome. – josteinaj Jan 15 '16 at 13:07
  • 2
    @josteinaj https://git-scm.com/docs/gitattributes has more in-depth materials for this answer. – PyRulez Jan 15 '16 at 15:00
  • 2
    +1 This is an absolutely brilliant answer! I had only one file clocking in at 116MB. I added the two filters and then named the single file I needed compressed in `.gitattributes`. Elegant! – aardvarkk Nov 03 '16 at 01:40
  • 2
    @pyrulez can you provide a little more info on what you add to the .gitattributes file? – Afflatus Feb 14 '17 at 19:07
  • 1
    You should use `gzip --rsyncable` so that the resulting binary files are more amenable to binary diffing to reduce the size of the repository. – Mathias Rav Aug 29 '18 at 10:19
10

A very good solution will be to use:

https://git-lfs.github.com/

Its an open source designed to work with Large files.

CodeWizard
  • 128,036
  • 21
  • 144
  • 167
  • 2
    Yes, I've tried it, but I make changes to the text file frequently so it would create a new 190MB file in LFS very often. As I understand LFS, it's best for files that rarely change. – josteinaj Jan 12 '16 at 12:23
  • 1
    I agree git-lfs in GitHub works well. The issue I ran into is that it has a bandwidth limit, which for an enterprise system will quickly be exceeded and/or become very expensive. Not only do they charge for the cost of storing the file, but in the context of bandwidth, you are paying every time you have developers pulling down your LFS repo or every pull. Even worse, if you have a CIS. Imagine a build system that has a binary that is 300MB is size and you have 1300 builds before you release. every build pulls down that Git LFS repo. You end up with GitHub becoming a bit expensive. – ConfusedDeer May 22 '17 at 18:54
  • 1
    Nice, this was exactly what I was looking for! – Tiago Martins Peres May 11 '20 at 11:10
3

You can create a script/program in any language to divide or unite files.

Here an example to divide a file written in Java (I used Java because I feel more comfortable on Java than any other, but any other would work, some will be better than Java too).

public static void main(String[] args) throws Exception
{
    RandomAccessFile raf = new RandomAccessFile("test.csv", "r");
    long numSplits = 10; //from user input, extract it from args
    long sourceSize = raf.length();
    long bytesPerSplit = sourceSize/numSplits ;
    long remainingBytes = sourceSize % numSplits;

    int maxReadBufferSize = 8 * 1024; //8KB
    for(int destIx=1; destIx <= numSplits; destIx++) {
        BufferedOutputStream bw = new BufferedOutputStream(new FileOutputStream("split."+destIx));
        if(bytesPerSplit > maxReadBufferSize) {
            long numReads = bytesPerSplit/maxReadBufferSize;
            long numRemainingRead = bytesPerSplit % maxReadBufferSize;
            for(int i=0; i<numReads; i++) {
                readWrite(raf, bw, maxReadBufferSize);
            }
            if(numRemainingRead > 0) {
                readWrite(raf, bw, numRemainingRead);
            }
        }else {
            readWrite(raf, bw, bytesPerSplit);
        }
        bw.close();
    }
    if(remainingBytes > 0) {
        BufferedOutputStream bw = new BufferedOutputStream(new FileOutputStream("split."+(numSplits+1)));
        readWrite(raf, bw, remainingBytes);
        bw.close();
    }
        raf.close();
}

static void readWrite(RandomAccessFile raf, BufferedOutputStream bw, long numBytes) throws IOException {
    byte[] buf = new byte[(int) numBytes];
    int val = raf.read(buf);
    if(val != -1) {
        bw.write(buf);
    }
}

This will cost almost nothing (Time/Money).

Edit: You can create a Java executable and add it to your repository, or even easier, create a Python (Or any other language) script to do this, and save it as plain text on your repository.

Mayuso
  • 1,291
  • 4
  • 19
  • 41
  • 1
    Thanks! Do you know if it would be possible to automatically run this before commiting and automatically merge after checking out? – josteinaj Jan 11 '16 at 15:12
  • 5
    Check out the Unix/Linux `split` and `cat` commands. `split -b 100M big-file big-file-` ... `cat big-file-* > big-file` – Keith Thompson Jan 14 '16 at 02:36
  • @KeithThompson thanks. I knew about those but discarded the idea since I wanted it to work in Windows as well. However, it seems that git runs its git hooks in a bash environment even in Windows, so those commands might work there as well, I'm not sure. They would definitely be much simpler than implementing something myself (I created a small program in golang for testing). – josteinaj Jan 15 '16 at 13:10