WHAT operations become slow when git repos become large, and WHY?

Question

This question was asked in various forms on SO and elsewhere, but no answer I was able to find has satisfied me, because none list the problematic/non problematic actions/commands, and none give a through explanation of the technical reason for the speed hit.

For instance:

So, I am forced to ask again:

Of the basic git actions (commit, push, pull, add, fetch, branch, merge, checkout), which actions become slower when repos become larger (NOTICE: repos, not files for this question)

And,

Why each action depends on repo size (or doesn't)?

I don't care right now about how to fix that. I only care about which actions' performance gets hit, and the reasoning according to current git architecture.

Edit for clarification:

It is obvious that git clone for instance, would be o(n) the size of the repo.

However it is not clear to me that git pull would be the same, because it is theoretically possible to only look at differences.

Git does some non trivial stuff behind the scenes, and I am not sure when and which.

Edit2:

I found this article, stating

If you have large, undiffable files in your repo such as binaries, you will keep a full copy of that file in your repo every time you commit a change to the file. If many versions of these files exist in your repo, they will dramatically increase the time to checkout, branch, fetch, and clone your code.

I don't see why branching should take more than O(1) time, and I am also not sure the list is full. (for example, what about pulling?)

Just as anecdotal evidence to obtain a datapoint: I work every day in a large monorepo that has 87000 files and is 8 GB in size. I'm using a high-end laptop, and none of the git commands appear to be slow or have a noticeable delay. Let me repeat: none of them that I can recall (except for `git clone` of course, but that's a given). Even `git pull` is pretty fast (takes ~20 sec to pull 20,000 files) on a network connection of 40 Mbps when working remotely through a VPN server 2500 miles away. That being said, care is taken to ensure we do not commit large binaries. — Gabriel Staples, Jul 21 '19 at 16:45

VonC · Answer 1 · 2019-11-28T07:16:31.380

However it is not clear to me that git pull would be the same, because it is theoretically possible to only look at differences.

Since Git 2.23 (Q3 2019), it is not O(N), but O(n log(N)): see "Git fetch a branch once with a normal name, and once with capital letter".

The main issue is the log graph traversal, checking what we have and have not (or computing forced update status).
That is why, for large repositories, recent Git editions have introduced:

reachability bitmap,
commit graph,
loose cache,
Commit Graphs Chains.
And pack-file tree discovery for push commands.

they will dramatically increase the time to checkout, branch, fetch, and clone

That won't be because of operation being not O(1).
It has to do with the size of the large number of binaries to transfert/copy around when doing those operations.
Creating a new branch remains very fast, but switching to it when you have to update those binary files can be slow, simply from an i/o perspective (copy/update/delete large files).

score 2 · Answer 2 · answered Jul 21 '19 at 16:02

I see two major issues which you have opened for discussion. First, you are asking about which Git operations get slower as repos get larger. The answer is, most Git operations will get slower as the repo gets larger. But the operations which would make Git seem noticeably slower are those which involve interacting with the remote repository. It should be intuitive to you that if the repo bloats, then things like cloning, pulling, and pushing would take longer.

The other issue you have touched on concerns whether or not large binary files should even be committed in the first place. When you make a commit, a copy of each file in the commit is compressed and added to the tree. Binary files have a tendency to not compress well. As a result, adding large binary files can over time cause your repo to bloat. In fact, many teams will configure their remote (e.g. GitHub) to block any such commits containing large binaries.

Thanks for the answer. Please see my clarification edit. Also, notice I care more about the repo as a whole than about large binary files. For instance, why would a git pull take o(repo_size) rather than o(diff_size)? — Gulzar, Jul 21 '19 at 16:08

WHAT operations become slow when git repos become large, and WHY?

2 Answers2

Linked