How to determine how much data git pull will download?

Question

Suppose I'm using my cell phone's expensive data plan via hotspot, and I'm paying by the MB.

Is there a way to check how much data git will need to download if I issue git pull? (Without downloading it obviously)

Solutions specific to GitHub are also helpful. – Alec Jacobson Jun 03 '18 at 17:06 — Alec Jacobson, Jun 03 '18 at 17:06

score 3 · Answer 1 · answered Jun 03 '18 at 17:34

The short answer is "no". The long answer is "maybe", but you'll need some sort of assistant system.

Note that git pull is just git fetch followed by a second Git command. The second command runs locally and causes no data transfer—all the transfers occur during git fetch. So the question is really: how much data will git fetch transfer? The answer depends on what's in your Git repository, what's in their Git repository, what you tell your Git to ask for, and how good their compression is.

Fetching can use either a dumb protocol or a smart protocol. The primary difference is that with a dumb protocol you often don't get any benefit from what you already have downloaded: for instance, if the other Git has everything in a giant pack file, you must re-download the entire giant pack file. This would generally be seriously awful for your use case. Fortunately virtually everyone uses the smart protocol instead.

The Pro Git book page I linked-to above gives an outline of the smart protocol, but does not go into detail (for good reasons—I'll omit a lot of the gory detail as well). The short version, though, is that your Git and their Git will converse about commit hashes, but won't go deeper into tree and blob hashes: your Git will tell them which commit hashes your Git wants and which ones your Git has. Telling their Git that your Git has the commit with hash ID X tells their Git that your Git not only has X itself but also all of its ancestors.¹ Their Git can then deduce that you need commits that are descendants of X, but not X nor any of its ancestors.

From this, their Git can deduce that you not only have commit X and its ancestors, but also any trees and blobs that appear in X and its ancestors. (See the --objects-edge argument to git rev-list.) This allows their Git to build a thin pack that delta-compresses each object against not only other objects within the pack, but also any objects that you already have as implied by your having commit X and its ancestors.

Ignorning all the data transferred by the smart protocol's communication, then, the bulk of your received data will be the resulting thin pack. But the only way to find the size of the thin pack is to have the other Git build the thin pack.

It's easy enough to imagine a third-party bit of shim software that you can insert into this process: you would run git fetch, directing your Git to contact your third-party software, which you run on a middle machine, located somewhere you don't pay by the byte. This third-party software would relay the have/want conversation between the two Gits, so that their Git builds a thin pack. Then, however, instead of sending you the thin pack, your third-party shim would keep the thin pack on the middle machine. Then, over some side channel, you have the middle machine report the size to you. You decide whether to accept the cost of transferring the thin pack, or not; this determines whether the fetch finishes successfully, or fails with a complaint about the remote (really your shim software) closing the connection unexpectedly.

¹This ignores shallow repositories, which complicate everything.

How to determine how much data git pull will download?

1 Answers1

Linked