4

Git's blob object file format is blob <size string>\0<data>. The blob-identifying SHA-1 hash is calculated not from the blob contents alone, but from the header-augmented blob data (as described above).

As a purist I do not like that architecture. It mixes the universal property of the data (its SHA1 hash) with some git-specific header.

Another advantage of pure-data blob storage is that the files can be added to the index using "copy-on-write" instead of copying the whole file. The required space could be halved and some operations could become faster.

So, why did Git developers choose to use the header-based format instead of the pure data format?

P.S. AFAIK in the early days of Git the SHA-1 hash was based on the compressed data.

Ark-kun
  • 6,358
  • 2
  • 34
  • 70
  • 2
    I can only really guess as to "why" but I suspect it's so that git can read the first block of the object, decompress, and see how much memory to malloc() for the full decompressed object. – torek Dec 22 '15 at 22:28

1 Answers1

3

AFAIK in the early days of Git the SHA-1 hash was based on the compressed data.

Yes, and that lead to all kind of "optimizations" like commit 65c2e0c, git 0.99, June 2015:

Find size of SHA1 object without inflating everything.

But that new format, illustrated in "How does git compute file hashes?", can be traced back to:

Each time, the length of the data is needed to do anything with the data itself.

Community
  • 1
  • 1
VonC
  • 1,262,500
  • 529
  • 4,410
  • 5,250