4

Assuming I have a big text file, and it would change in some parts periodically. I want to keep it synchronized with its remote version on a git server, preferably by just uploading its changed portions.

What's the default behavior of git? Does git upload the entire file each time it has been changed? Or has an option to upload just the differences?

What about non-text (binary) files?

Thanks

DummyBeginner
  • 411
  • 10
  • 34
  • I have the same question but with Bitbucket and git. Sometimes, I need to push a 100MB ZIP file to Bitbucket, and I activated the LFS feature of Bitbucket/git. So the question here is that with every push, will it push a duplicate or the difference? – tarekahf Jun 17 '22 at 16:43

2 Answers2

6

Does git upload [an] entire file each time it has been changed? Or has an option to upload just the differences?

The answer to this is actually "it depends".

The system you're describing—where we say "given existing file F, use the first part of F, then insert or delete this bit, then use another part of F" and so on—is called delta compression or delta encoding.

As Tim Biegeleisen answered, Git stores—logically, at least—a complete copy of each file with each commit (but with de-duplication, so if commits A and B both store the same copy of some file, they share a single stored copy). Git calls these stored copies objects. However, Git can do delta-compression of these objects within what Git calls pack files.

When one Git needs to send internal objects to another Git, to supply commits and their files, it can either:

  • send the individual objects one by one, or
  • send a pack file containing packed versions of the objects.

Git can only use delta-compression here if you use a Git protocol that sends a pack file. You can easily tell if you're using pack files because after git push you will see:

Counting objects: ... done
Compressing objects: ... done

This compressing phase occurs while building the pack file. There's no guarantee that when Git compressed the object, it specifically did use delta-compression against some version of the object that the other Git already has. But that's the goal and usually will be the case (except for a bug introduced in Git 2.26 and fixed in Git 2.27).

Technical details, for the curious

There is a general rule about pack files that git fetch and git push explicitly violate. To really understand how this all works, though, we should first describe this general rule.

Pack files

Git has a program (and various internal functions that can be used more directly if/as needed) that builds a new pack file using just a set of raw objects, or some existing pack file(s), or both. In any case, the rule to be used here is that the new pack file should be completely self-contained. That is, any object inside pack file PF can only be delta-compressed against other objects that are also inside PF. So given a set of objects O1, O2, ..., On, the only delta-compression allowed is to compress some Oi against some Oj that appears in this same pack file.

At least one object is always a base object, i.e., is not compressed at all. Let's call this object Ob. Another object can be compressed against Ob1, producing a new compressed object Oc1 Then, another object can be compressed against either Ob1 directly, or against Oc1. Or, if the next object doesn't seem to compress well against Ob1 after all, it can be another base object, Ob2. Assuming the next object is compressed, let's call it Oc2. If it's compressed against Oc1, this is a delta chain: to decompress Oc2, Git will have to read Oc2, see that it links to Oc1, read Oc1, see that it links to Ob1, and retrieve Ob1. Then it can apply the Oc1 decompression rules to get the decompressed Oc1, and then the decompression rules for Oc2.

Since all these objects are in a single pack file, Git only needs to hold one file open. However, decompressing a very long chain can require a lot of jumping around in the file, to find the various objects and apply their deltas. The delta chain length is therefore limited. Git also tries to place the objects, physically within the pack file, in a way that makes reading the (single) pack file efficient, even with the implied jumping-around.

To obey all these rules, Git sometimes builds an entirely new pack file of every object in your repository, but only now and then. When building this new pack file, Git uses the previous pack file(s) as a guide that indicates which previously-packed objects compress well against which other previously-packed objects. It then only has to spend a lot of CPU time looking at new (since previous-pack-file) objects, to see which ones compress well and therefore which order it should use when building chains and so on. You can turn this off and build a pack file entirely from scratch, if some previous pack file was (by whatever chance) poorly constructed, and git gc --aggressive does this. You can also tune various sizes: see the options for git repack.

Thin packs

For git fetch and git push, the pack building code turns off the "all objects must appear in the pack" option. Instead, the delta compressor is informed that it should assume that some set of objects exist. It can therefore use any of these objects as a base-or-chain object. The assumed-to-exist objects must be findable somewhere, somehow, of course. So when your Git talks to the other Git, they talk about commits, by their hash IDs.

If you are pushing, your Git is the one that has to build a pack file; if you're fetching, this works the same with the sides swapped. Let's assume you are pushing here.

Your Git tells theirs: I have commit X. Their Git tells yours: I too have X or I don't have X. If they do have X, your Git immediately knows two things:

  1. They also have all of X's ancestors.1
  2. Therefore they have all of X's tree and blob objects, plus all of its ancestors' tree and blob objects.

Obviously, if they do have commit X, your Git need not send it. Your Git will only send descendants of X (commits Y and Z, perhaps). But by item 2 above, your Git can now build a pack file where your Git just assumes that their Git has every file that is in all the history leading up to, and including, commit X.

So this is where the "assume objects exist" code really kicks in: if you modified files F1 and F2 in commits Y and Z, but didn't touch anything else, they don't need any of the other files—and your new F1 and F2 files can be delta-compressed against any object in commit X or any of its ancestors.

The resulting pack file is called a thin pack. Having built the thin pack, your push (or their responder to your fetch) sends the thin pack across the network. They (for your push, or you for your fetch) must now "fix" this thin pack, using git index-pack --fix-thin. Fixing the thin pack is simply a matter of opening it up, finding all the delta chains and their object IDs, and finding those objects in the repository—remember, we've guaranteed that they are findable somewhere—and putting those objects into the pack, so that it's no longer thin.

Multiple pack files

The fattened packs are as big as they have to be, to hold all the objects they need to hold. But they're no bigger than that—they don't hold every object, only the ones they need to hold. So the old pack files remain.

After a while, a repository builds up a large number of pack files. At this point, Git decides that it's time to slim things down, re-packing multiple pack files into one single pack file that will hold everything. This allows it to delete redundant pack files entirely.2 The default for this is 50 pack files, so once you've accumulated 50 individual packs—typically via 50 fetch or push operations—git gc --auto will invoke the repack step and you'll drop back to one pack file.

Note that this repacking has no effect on the thin packs: those depend only on the existence of the objects of interest, and this existence is implicit in the fact that a Git has a commit. Having a commit implies having all of its ancestors (though see footnote 1 again), so once we see that the other Git has commit X we're done with this part of the computation, and can build our thin pack accordingly.


1Shallow clones violate this "all ancestors" rule and complicate things, but we don't really need to go into the details here.

2In some situations it's desirable to keep an old pack; to do so, you just create a file with the pack's name ending in .keep. This is mostly for those setups where you're sharing a --reference repository.

torek
  • 448,244
  • 59
  • 642
  • 775
  • According to your answer, Would you please elaborate on git's behavior in these 2 scenarios? **1.** Does git pack files only before pushing? **2.** Assuming we have a single big text file and want to push its commit to a remote git. Does git pack a single file too or there should be several objects to packing process occur? – DummyBeginner Aug 17 '20 at 10:55
  • As regards delta-compression in pack files, how does this affect the size of the transferred files to the remote server? I mean according to my impression, Git uses delta-compression whenever it updates the local pack file, but at the time of pushing to remote server sends the whole new pack file, then at the remote server, git uncompresses the received pack file and updates old objects on the server (because there are no pack files since pack files are being created at `pushing` stage on the server and we neither pull from server nor push from it). (continues...) – DummyBeginner Aug 17 '20 at 10:57
  • .... In conclusion, each time the client git pushes a new commit to the remote server, it sends a pack file with roughly the same size (because it contains the whole file, not just the differences, and delta-compression only took place at the time of updating the pack file locally) – DummyBeginner Aug 17 '20 at 10:58
  • @DummyBeginner: this ... is a little complicated. I'll update the answer with a technical-details section. :-) – torek Aug 17 '20 at 16:36
  • Thanks a lot,I love the way you scrutinize the process. How could I give a bonus to this answer? Getting back to your updated answer, Is this scenario correct? : In a single file repository when I push to a remote Git, I'm sending a thin pack (except the 1st pushing) that contains only delta-compression objects with a small size,And the receiver Git **automatically** applies `git index-pack --fix-thin` to received pack then merges its differences to the base object on the remote Git. If it's the case,Then we could say except the first pushing,**Git doesn't upload an entire file to other Git.** – DummyBeginner Aug 18 '20 at 12:48
  • What about Tim's answer? Isn't this the case that: " `If Git commits a file, it will generally commit the entire file, and it will compress the file first. Git does not work by committing diffs made to a file.` " . Since as per what was discussed, Git actually send just **diff** commits to remote Git via thin packs – DummyBeginner Aug 18 '20 at 12:50
  • @DummyBeginner: There's a minor technical difference between a "diff" (as in `git diff`) and a "delta" (as in a Git pack-file): the deltas are a binary encoding, and work with any object, while diffs depend on text—the diff engine breaks a file into individual lines and matches up the lines. So by sending a thin pack, Git can send a delta, but it's not a diff (in `git diff` terms). Functionally, though, there's no real difference between a user-readable diff and a binary delta: both consist of instructions to keep some old bytes, or delete some old bytes, or insert some new ones. – torek Aug 18 '20 at 17:13
  • Meanwhile, Git has two storage formats for what it calls *objects* ("blob" objects hold file *contents*, "tree" objects hold file *names*, and commit objects hold commit data). One format is the *loose object*, which is just zlib-compressed with a header. The other format is the *packed object*, which may be delta-compressed as well. Some push/fetch protocols work with loose objects, in which case, you always just get the zlib-compressed data for each object. It's when you have pack files that things get particularly complicated. – torek Aug 18 '20 at 17:22
  • " it specifically did use delta-compression against some version of the object that the other Git already has. But that's the goal and usually will be the case (except for a bug introduced in Git 2.26 and fixed in Git 2.27": wait... which one? Which bug? The protocol v2 fetch reverted (https://stackoverflow.com/a/60253725/6309, https://stackoverflow.com/a/61565821/6309)? The partial fetch issue (https://stackoverflow.com/a/52526704/6309)? The object walk with object filter (https://stackoverflow.com/a/60512453/6309)? – VonC Aug 19 '20 at 20:20
  • @VonC: I'm not 100% sure: it was a v2-only bug, so reverting from v2 to v0 worked around it, and then one or more of the bugs you noted and maybe others. I'm in the middle of moving and am having trouble keeping up with the mailing list... – torek Aug 19 '20 at 20:24
  • I have the same question but with Bitbucket and git. Sometimes, I need to push a 100MB ZIP file to Bitbucket, and I activated the LFS feature of Bitbucket/git. So the question here is that with every push, will it push a duplicate or the difference? – tarekahf Jun 17 '22 at 16:44
  • 1
    @tarekahf: I'm not sure what the large-file-object code uses. Git would push a deltified object, provided it can find something to delta against, as part of the pack file, assuming smart push protocol. When and where the "provided" is met is much harder to quantify, but if you want to experiment with the source, the key is to build a "thin pack" the way `git push` does. This involves using `git rev-list` carefully with `--objects-edge` or `--objects-edge-aggressive` (q.v.). – torek Jun 19 '22 at 07:00
  • @torek thanks a lot for the reply. I don't follow what you mentioned 100%. I need to do more research to understand your reply. I think I will submit the question to Bitbucket support. – tarekahf Jun 19 '22 at 16:02
2

If Git commits a file, it will generally commit the entire file, and it will compress the file first. Git does not work by committing diffs made to a file. Git is actually quite suitable for versioning large text files, as these files compress very well, and therefore will leave a commit trail which takes up minimal space.

On the other hand, binary files do not work very well with Git. The reason for this is that they behave opposite of text files with regard to compression. Binary files do not compress well, and therefore versioning large binary files in your Git repository can quickly cause that repo to bloat.

Per the comment/question by @eftshift0 below, I would also like to add a clarification regarding how Git differs from other version control systems. With more classic version control systems such as Perforce, all versions of files, including binary files, live on some remote storage system (what Perforce calls the "depot"). When you work in a local branch in Perforce, you really just have a shallow copy of one version of each file in your local system. It doesn't really even matter from a storage space point of view whether or not the binary file is compressed, since there is only one snapshot, and besides you also probably want to work with the uncompressed version. In contrast, in Git's model, when you clone a Git repository you bring in every version of every file into your local system. In the case of text source files (such as Java, C, etc.), Git partially gets around this problem by zip compressing them. Most source code files have very repetitive texts in many cases, and zip compression works well on them, decreasing the size by a substantial amount. However, in the case of binary files, zip compression does not work very well. As a result, if you maintain many versions of binary files in your Git history, your repository can easily bloat. When you go to clone such a repository, you may end up essentially pulling in the full size versions of every such binary file. For obvious reasons, this does not scale well, and a result it is generally not recommended to version large binary files in Git.

Tim Biegeleisen
  • 502,043
  • 27
  • 286
  • 360
  • 1
    In addition, when you set compression in git, system pack files with same path, so if you have small differences even in binary file, used space will not grow too quick. – Leszek Mazur Jun 30 '20 at 06:00
  • Does this thing of committing the entire file, has anything to do with the checksum concept in Git? I mean Git detects changed files by checking their hash, so to have correct hash it must upload the entire file to the remote server, and using just the diffs would ruin the hashes? – DummyBeginner Aug 17 '20 at 11:05
  • @Tim `Binary files do not compress well, and therefore versioning large binary files in your Git repository can quickly cause that repo to bloat.`.... but other VCSs do a better job at this? – eftshift0 Aug 19 '20 at 14:27
  • 1
    @eftshift0 You might be missing an important distinction between Git and most other VCS tools (notably excluding Mercurial, which in many ways is similar to Git). Please read the added last paragraph of my answer for more information. – Tim Biegeleisen Aug 19 '20 at 15:38
  • Oh, I get the point... so it's about the size of the _local_ repo / working copy.... but if you go into the server, other VCSs will just have as much bloat in the repo as what you get on a git local repo if you had that many binary files. Is that right? – eftshift0 Aug 19 '20 at 15:42
  • Yes, that's right, but it's much easier to manage a large storage area in somewhere like the cloud than it is on your laptop. And, this was certainly true 10 years ago when Git was created. – Tim Biegeleisen Aug 19 '20 at 15:44