0

We know that Git doesn't store diff information in commit objects? One commit is sufficient to recreate the codebase at that point in time.

Below is my understanding, please correct me if I was wrong:

Let's say we have a large text file words.txt that contains thousands lines of word. I append only one singe word to this file. So Git has to store two files internally, original words.txt for one commit, and the appended words.txt for next commit. Considering the latter only has one line of difference compared to the former, so it is obviously inefficient and cost more disk space?

Maik Lowrey
  • 15,957
  • 6
  • 40
  • 79
  • 6
    When git creates packfiles, it does delta-compression, but for lose objects on disk, it doesn't. – Lasse V. Karlsen Dec 01 '21 at 06:25
  • The very narrow example you're describing is indeed a space cost. But overall, the storage system is excellent, considering a huge majority of blobs stay unchanged from one commit to another. If you stored deltas only, we could come up with a reverse example where the file is tiny but the delta is a huge thing to store. (Exemple : you delete all lines from your big file but the first one) – Romain Valeri Dec 01 '21 at 07:13
  • 2
    That reminds me of 2011: https://stackoverflow.com/a/8198276/6309 – VonC Dec 01 '21 at 07:43

1 Answers1

2

Git does indeed store full copies of each object. But Git also compresses each object in one of several ways:

  • All objects get de-duplicated (via hashing). So committing a large file N > 1 times produces just one copy of the large file.

  • Loose objects, which are those stored in .git/objects/ab/cdef0123... and the like, are zlib-compressed. This is pretty well-hidden in Git, although opening and reading one of these loose objects is easy enough, and running it through a zlib decompressor reveals the secret.

  • Packed objects, which are those stored in .git/objects/pack/, are delta encoded. This is not the same as a diff, and the algorithms that choose objects to delta-compress against other objects is considerably better-hidden. There are some technical documents covering the packing heuristics and format (and a separate article on multi-pack index files, which are somewhat newer).

In your example, we'll have two loose objects for some time, each of which is moderately large (zlib compression of human language texts is efficient but not a miracle). But eventually Git will pack the two loose objects into a pack file, and here, with any luck at all, Git will store just the larger file in the pack file, with the smaller earlier variant encoded as "take the larger object and remove some bytes".1

It's still a good idea to avoid storing large incompressible binry data in Git (e.g., rather than a gzipped tarball, store all the files themselves). The packing and compressing system deals poorly with large incompressible binary data.


1Note that since pack files are normally updated incrementally, the packed objects can depend on the order in which the various packs were generated. You will in theory get better packing with relatively infrequent repack operations, since they'll have a "later" view of the file, at least in this particular case.

torek
  • 448,244
  • 59
  • 642
  • 775