Why does Git store (and hash) blob size in the blob file?

Question

Git's blob object file format is blob <size string>\0<data>. The blob-identifying SHA-1 hash is calculated not from the blob contents alone, but from the header-augmented blob data (as described above).

As a purist I do not like that architecture. It mixes the universal property of the data (its SHA1 hash) with some git-specific header.

Another advantage of pure-data blob storage is that the files can be added to the index using "copy-on-write" instead of copying the whole file. The required space could be halved and some operations could become faster.

So, why did Git developers choose to use the header-based format instead of the pure data format?

P.S. AFAIK in the early days of Git the SHA-1 hash was based on the compressed data.

I can only really guess as to "why" but I suspect it's so that git can read the first block of the object, decompress, and see how much memory to malloc() for the full decompressed object. — torek, Dec 22 '15 at 22:28

score 3 · Answer 1 · edited May 23 '17 at 12:16

AFAIK in the early days of Git the SHA-1 hash was based on the compressed data.

Yes, and that lead to all kind of "optimizations" like commit 65c2e0c, git 0.99, June 2015:

Find size of SHA1 object without inflating everything.

But that new format, illustrated in "How does git compute file hashes?", can be traced back to:

git diff, in commit 051308f (git 1.4.0-rc1, May 2006)
git fast-import, started in commit db5e523 (git 1.5.0, Aug. 2006)

Each time, the length of the data is needed to do anything with the data itself.

Why does Git store (and hash) blob size in the blob file?

1 Answers1

Linked