53

I'm reading about Git LFS and see time and again that it works great for "large files"

Git Large File Storage (LFS) replaces large files such as audio samples, videos[...]

Version large files—even those as large as a couple GB in size—with Git.

Git Large File Storage (LFS) is a free, open-source extension that replaces large files with text pointers inside Git and stores the contents of those files on a remote server.

Unfortunately, I don't see anywhere what a "large file" actually is. It's clear that something that takes up several gigabytes is a large file, but what about something smaller?

Will I benefit from Git LFS with "large files" as small as 50 MB? 20MB? 5MB? 1MB? Less than 1MB?

How large does a "large file" have to be to benefit from Git LFS compared to regular Git?

Community
  • 1
  • 1
Thunderforge
  • 19,637
  • 18
  • 83
  • 130
  • 1
    GitHub rejects commit with files > 100M. GitHub was there first, and after it all git hostings started to do the same with similar limitations. BitBucket, AFAIR limits files > 50M. – phd Feb 27 '18 at 22:56
  • I just saw this question, perhaps [this one](https://stackoverflow.com/q/57922231/5784831) should be linked? – Christoph Sep 13 '22 at 18:41

3 Answers3

41

There is no exact threshold to define what is a large file. This is up to the user. To see if you need to store some files using Git LFS you need to understand how git works.

The most fundamental difference between Git and other source control tools (perforce, svn), is that Git stores a full snapshot of the repository on every commit. Thus when you have a large file, the snapshot contains a compressed version of this file (or a pointer to the file blob if the file wasn't changed). The repository snapshot is stored as a graph under the .git folder. Thus if the file is "large", the repository size will grow rapidly.

There are multiple criteria to determine whether to store a file using Git LFS.

  • The size of the file. IMO if a file is more than 10 MB, you should consider storing it in Git LFS

  • How often the file is modified. A large file (based on the users intuition of a large file) that changes very often should be stored using Git LFS

  • The type of the file. A non-text file that cannot be merged is elligible for Git LFS storage

Will I benefit from Git LFS with "large files" as small as 50 MB? 20MB? 5MB? 1MB? Less than 1MB?

Depending on how often the file changes, in any size mentioned you can benefit. Consider the case where you do 100 commits editing the file every time. For a 20MB file that can be compressed say to 15 MB, the repository size would increase by approximately 1.5GB if the file is not stored using Git LFS.

yamenk
  • 46,736
  • 10
  • 93
  • 87
  • 3
    Considering the complete opposite of not using LFS: _Why not store all files in LFS?_ Because files in LFS [cannot easily be diffed](https://github.com/git-lfs/git-lfs/issues/440), basically breaking an important part of the version control system. – Brecht Machiels Jan 19 '21 at 16:02
  • I initially added some comments suggesting improvements to this answer, but realised they basically amounted to a complete rewrite, so have posted as my own answer below. – IMSoP Sep 13 '22 at 17:36
3

Most version control systems are optimised for "small-ish text files". Storing a 100MB file in any VCS will take up at least 100MB of file system somewhere (assuming it can't easily be compressed). If you store 3 completely different versions, that's 300MB of storage somewhere.

The difference with distributed version control systems, such as git, is that they include the full history in every working copy. That means every version of every file takes up space in every working copy, forever, even if the file is deleted in a later revision. (On a centralised VCS, this space would only be spent on the central server.)

There is however a bright side: git is quite smart about how it stores things, at two levels of abstraction:

  • At one level, git is a "content-addressed database": it stores "blobs" based on a hash of their content. That means that a file needs a new blob only when its content changes; and in fact, only when that content has never occurred before in the entire history of the repository.
  • At the next level down, even that "blob" may not be stored in full on the file system, because a packfile may include it as a delta (a set of changes) from a similar blob.

That leads to a few considerations of when LFS, or some other out-of-repo solution, might be useful:

  • How big is the file? If it's a few megabytes, that may be enough on its own to not include it in history.
  • How frequently does it change? A 100KB file re-generated with random content on every commit would add a megabyte to every working copy for every 10 commits.
  • Is every version actually different? If you have two different versions of a logo, and you keep changing your mind which to use, it won't take up any extra space as long as you use exactly the same two files. The same applies to renaming a file: if the content is unchanged, no extra "blob" is needed.
  • How different are the versions, from a binary point of view? If you keep appending to a very long log file, git will probably notice and store it in a relatively efficient packfile.
IMSoP
  • 89,526
  • 13
  • 117
  • 169
-3

LFS is a tool for maintenance of resources of projects. Suppose you have a project which has *.psd files which used in Front-end. These files are usually large and the versioning of a file is not respect to previous versions(git saves history of changes for text files in commits but for binary files this approach could not be used. diff of two .cpp files has meaning but diff of two raw photo does not.). So if you put resources to repository its size and cloning time will be growth unsightly. Moreover maintenace will be hard.

How can overcome this issue? First of all one good idea is that to split database of large files from codes in server-side. Another is that the clients allowed to pull part of them which they want to use currently on his/her local machine(i.e not all of previous files).

What LFS does? It hash its tracked files and store theme as a pointers to original files. Store original files to separate database on server-side. Local repositories have all of pointers in their history but when you checkout a specific commit, it pull just its contents. The size of local repository and time of clone will decreases impressively in this manner.

PS: The method of receiving files in lfs is differ from git. So I think it uses some technics to split large files, send them in different parallel connections and and merge them... and such stuff which can improve its functionality... But what is important is that it can increase time of clone/pull for hundred/thousands of small files.

Also note that git has problem with files larger that 4GB in windows.

Bonje Fir
  • 787
  • 8
  • 25
  • 1
    This doesn’t answer my question of what size a file has to be in order to benefit from Git LFS. It talks about large and small files, but doesn’t define what they are. – Thunderforge Feb 28 '18 at 14:24
  • @Thunderforge Yes. I hoped clearly explain that there is no *size* criteria to use LFS. It is more relative to its type(i.e `.bin`, `.psd`, `.tif` other than plain text). Update frequency of large file... – Bonje Fir Feb 28 '18 at 14:44
  • @Thunderforge There is no restrict definition for *large file* because it has not restrict value! – Bonje Fir Feb 28 '18 at 14:52
  • *"But what is important is that it can increase time of clone/pull for hundred/thowsands of **small** files."* So what is a "small file" then? Is it going to be faster to use LFS with a 1 MB file or regular Git? – Thunderforge Feb 28 '18 at 15:35
  • 2
    Since files tracked with LFS are stored outside the actual git repository, they have to be fetched separately which will take some time. You can (or have to) decode on your own, which files you add to the repo directly and which you track with LFS. Files up to a couple MB are usually fine. If you exceed 100 MB (which btw. is the hard limit for files stored in git on Github), you should definitely use LFS. Between that, it's up to you to come up with a suitable decision based on your data. – Holger Just Feb 28 '18 at 15:56
  • @Thunderforge As I said above its a trade-off. Hundred/thousands files with 1MB size? Generally put your files to git until the repository size become matter(actually you can have an insight about update freq before doing this)! For this scenario actually it is better to put them to git. – Bonje Fir Feb 28 '18 at 15:57
  • That's the sort of information I'm looking for. You're saying that 1MB is better with regular Git. My question ends with "Will I benefit from Git LFS with "large files" as small as 50 MB? 20MB? 5MB? 1MB? Less than 1MB?" That's what I want to know, and which you answer doesn't address. – Thunderforge Feb 28 '18 at 16:18