But why does it seem to create a new blob every time I git add a change to a file, i.e. as opposed to editing the existing blob, or making a new blob and deleting the old one?
No blob can ever be changed. This is the same as the rule about commits: no commit can ever be changed.
The reason is that the hash ID of each Git object—blobs and commits are two of the four types of internal Git object—is just a cryptographic checksum of the contents stored as that object. In the case of a file ("blob"), the actual contents are the five ASCII characters b
, l
, o
, b
, space, then the size of the blob decimalized and also stored in ASCII, then an ASCII NUL byte, and then the stored data. For instance hello
is stored as what Python might represent as b"blob 5\0hello"
.
(You can calculate this hash using an SHA1 hasher, or by using git hash-object
:
$ echo -n hello | git hash-object --stdin
b6fc4c620b67d95f953a5c1c1230aaab5db5a1b0
or:
$ python3
[snip]
>>> import hashlib
>>> hashlib.sha1(b"blob 5\0hello").hexdigest()
'b6fc4c620b67d95f953a5c1c1230aaab5db5a1b0'
So any blob with hash ID b6fc4c620b67d95f953a5c1c1230aaab5db5a1b0
is necessarily the file hello
, or—if it's not—you can't store a file containing hello
(without a newline) in this Git repository. Finding a doppelgänger for some file (an evil twin that prevents storage of some other file) is nontrivial: see How does the newly found SHA-1 collision affect Git? for details.
So, when you git add
a file, Git creates a new blob, or re-uses an existing blob, depending on whether that file's data already exist as a blob in the repository. If you then git commit
, Git saves the contents permanently, associated with the new commit object. If you never commit that blob and no other commit or other entity refers to it either, Git eventually expires the blob through its garbage collection process (see git gc
).
(Note that these Git objects are also zlib deflated, and are the penultimate storage form for all four Git object types. However, after some time, existing objects may be packed into a pack file, where they're delta-compressed against other objects before being zlib-deflated. The pack file is the ultimate storage form. Packed objects can be unpacked if necessary, though in normal operation Git just extracts the decompressed object data on the fly from the pack file while expanding the delta compression.)
(For completeness, the other two Git object types are tree and annotated tag. The tree objects store the file names, mapping from name to blob hash ID, along with the executable bit for the file. A commit object refers by hash ID to the tree that represents the snapshot. An annotated tag object is a special case data structure that contains the hash ID of another Git object, plus a data payload; in this data payload, you can store a GPG signature or some other digital signature, along with anything else you like. You can then point a lightweight tag to the annotated tag object, to get an annotated tag.)