Why does git make new blobs between 'git add' commands?

Question

So I've recently discovered the tool git cat-file and I've been playing around with it. I know that git uses blobs to store the actual content. But why does it seem to create a new blob every time I git add a change to a file, i.e. as opposed to editing the existing blob, or making a new blob and deleting the old one?

e.g.

touch hello.txt
// change hello.txt to contains 'hello'
git add hello.txt // creates a blob abc123 containing: 'hello'  

// change hello.txt to 'hello world'
git add hello.txt // creates a blob cba321 containing: 'hello world'  

git commit // creates a commit with tree pointing at blob cba321

So the purpose of the blob containing my intermediate, staged change i.e. blob abc123 containing "hello" is not obvious.

In terms of commits, hello.txt went from "" directly to "hello world", and I can't even get back my intermediate change abc123 without digging around in git blobs.

score 3 · Answer 1 · answered Mar 29 '19 at 22:13

But why does it seem to create a new blob every time I git add a change to a file, i.e. as opposed to editing the existing blob, or making a new blob and deleting the old one?

No blob can ever be changed. This is the same as the rule about commits: no commit can ever be changed.

The reason is that the hash ID of each Git object—blobs and commits are two of the four types of internal Git object—is just a cryptographic checksum of the contents stored as that object. In the case of a file ("blob"), the actual contents are the five ASCII characters b, l, o, b, space, then the size of the blob decimalized and also stored in ASCII, then an ASCII NUL byte, and then the stored data. For instance hello is stored as what Python might represent as b"blob 5\0hello".

(You can calculate this hash using an SHA1 hasher, or by using git hash-object:

$ echo -n hello | git hash-object --stdin
b6fc4c620b67d95f953a5c1c1230aaab5db5a1b0

or:

$ python3
[snip]
>>> import hashlib
>>> hashlib.sha1(b"blob 5\0hello").hexdigest()
'b6fc4c620b67d95f953a5c1c1230aaab5db5a1b0'

So any blob with hash ID b6fc4c620b67d95f953a5c1c1230aaab5db5a1b0 is necessarily the file hello, or—if it's not—you can't store a file containing hello (without a newline) in this Git repository. Finding a doppelgänger for some file (an evil twin that prevents storage of some other file) is nontrivial: see How does the newly found SHA-1 collision affect Git? for details.

So, when you git add a file, Git creates a new blob, or re-uses an existing blob, depending on whether that file's data already exist as a blob in the repository. If you then git commit, Git saves the contents permanently, associated with the new commit object. If you never commit that blob and no other commit or other entity refers to it either, Git eventually expires the blob through its garbage collection process (see git gc).

(Note that these Git objects are also zlib deflated, and are the penultimate storage form for all four Git object types. However, after some time, existing objects may be packed into a pack file, where they're delta-compressed against other objects before being zlib-deflated. The pack file is the ultimate storage form. Packed objects can be unpacked if necessary, though in normal operation Git just extracts the decompressed object data on the fly from the pack file while expanding the delta compression.)

(For completeness, the other two Git object types are tree and annotated tag. The tree objects store the file names, mapping from name to blob hash ID, along with the executable bit for the file. A commit object refers by hash ID to the tree that represents the snapshot. An annotated tag object is a special case data structure that contains the hash ID of another Git object, plus a data payload; in this data payload, you can store a GPG signature or some other digital signature, along with anything else you like. You can then point a lightweight tag to the annotated tag object, to get an annotated tag.)

Romain Valeri · Answer 2 · 2019-03-29T22:18:08.237

git add indeed creates blobs, since the index (or staging area, it has many names...) has the very purpose of preparing the snapshot which will consitute the next commit.

Also, you talk about editing or deleting a blob, but that would be contrary to the principles of the tool, since a snapshot has to be consistently reproducible, with all the blobs it references untouched. In a way, you never modify anything, you just add more things and relations.

And to answer your last point, no, you can't "even" get back to a state you did not consider worth saving.

Why does git make new blobs between 'git add' commands?

2 Answers2