How do git's built-in large file handling features deal with checksumming files?

Question

It seems that the git team has been working on large binary file handling features that don't require git LFS - features like partial clone, and sparse checkout. That's great.

The one thing I'm not totally clear about is how these features are supposed to improve this issue:

Correct me if I'm wrong, but every time you run git status, git quickly does a checksum of all the files in your working directory, and compares that to the stored checksums in HEAD to see which files changed. This works great for text files, and is so common, and so fast an operation that many shells build the current branch, and whether or not your current working directory is clean into the shell prompt:

With large files however, doing a checksum can take multiple seconds, or even minutes. That means every time you type git status, or in a fancy shell with a custom, git-enabled prompt hit "enter", it can take several seconds to checksum the large files in your working directory to figure out if they've changed. That means that either your git status command will take several seconds/minutes to return, or worse, EVERY command will take several seconds/minutes to return while your current working directory is in the git repo, as the shell itself will try to figure out the repo's current status to show you the proper prompt.

This isn't theoretical - I've seen this happen with git LFS. If I have a large, modified file in my working directory, working in that git repo becomes a colossal pain. git status takes forever to return, and with a custom shell, every single command you type takes forever to return as the shell tries to generate your prompt text.

Is this meant to be addressed by sparse checkout, where you just don't checkout the large files? Or is there something else meant to address this?

You can enable caching so that git assumes a file hasn't change if its modification time is the same as the last time you ran `git status`. — Raymond Chen, Feb 04 '22 at 04:18
@RaymondChen That would be awesome. What option is that? I wonder if it's possible to only enable caching on large files — John, Feb 04 '22 at 15:32
[Ways to improve git status performance](https://stackoverflow.com/questions/4994772/ways-to-improve-git-status-performance) — Raymond Chen, Feb 04 '22 at 18:35

score 2 · Accepted Answer · answered Feb 10 '22 at 01:23

Git stores certain information in the index, which reflects things like the file size, device and inode numbers, modification and inode change times, and various other attributes. If this information is changed, or is potentially stale, then Git will re-read the file to see if it's modified. This is potentially expensive, as you've noticed, and Git's detection here is the reason that Git LFS has this same performance problem: Git is telling Git LFS to reprocess the file.

What you want to do is find out what's modifying the attributes of your files. For example, if you have some sort of file monitoring or backup software, then that can cause this problem, or if you're using some sort of cloud syncing service (which you should avoid anyway because it will probably corrupt your repository). This can also happen if you mount the repository into a container, since the container will have different device and inode numbers, and then each time you alternate in which environment you run git status, the entire repository must be re-read.

For example, on my systems, I don't have this problem and my system performs just fine. However, if you really can't figure it out, you can try setting core.trustctime to false and/or core.checkstat to minimal (which you should try in that order). That will put less data in the index, and then it's less likely to become stale when nothing's changed. However, it also means that it's more likely that Git will fail to detect a legitimate change, so if you can avoid needing to do this, you should.

`various other attributes` - is there a full list of what git uses to decide if a file in the working directory has changed? I was under the assumption that it always ran a checksum on all files in the working directory on every invocation of `git status`, and running checksums on text files was just that fast that `git status` always returned almost immediately. Apparently that's not the case. — John, Feb 10 '22 at 17:04
Yes, the documentation of `core.checkstat` describes the items stored in the index. Run `git help git-config`. — bk2204, Feb 10 '22 at 23:28

How do git's built-in large file handling features deal with checksumming files?

1 Answers1