Git does indeed store full copies of each object. But Git also compresses each object in one of several ways:
All objects get de-duplicated (via hashing). So committing a large file N > 1 times produces just one copy of the large file.
Loose objects, which are those stored in .git/objects/ab/cdef0123...
and the like, are zlib-compressed. This is pretty well-hidden in Git, although opening and reading one of these loose objects is easy enough, and running it through a zlib decompressor reveals the secret.
Packed objects, which are those stored in .git/objects/pack/
, are delta encoded. This is not the same as a diff, and the algorithms that choose objects to delta-compress against other objects is considerably better-hidden. There are some technical documents covering the packing heuristics and format (and a separate article on multi-pack index files, which are somewhat newer).
In your example, we'll have two loose objects for some time, each of which is moderately large (zlib compression of human language texts is efficient but not a miracle). But eventually Git will pack the two loose objects into a pack file, and here, with any luck at all, Git will store just the larger file in the pack file, with the smaller earlier variant encoded as "take the larger object and remove some bytes".1
It's still a good idea to avoid storing large incompressible binry data in Git (e.g., rather than a gzipped tarball, store all the files themselves). The packing and compressing system deals poorly with large incompressible binary data.
1Note that since pack files are normally updated incrementally, the packed objects can depend on the order in which the various packs were generated. You will in theory get better packing with relatively infrequent repack operations, since they'll have a "later" view of the file, at least in this particular case.