git content tracking and unit of contents

Question

For the idea of git is content tracking rather than file tracking, my confusion comes from the following scenario: If I add into one git repository two files A,B in one commit, A and B have overlapped ( and different) contents, will git compare the two new file A and B? For revisions of A or B, I guess only incremental diff are stored, but for two new files on one commit, can git detect the common contents? If it is content based, what is the unit of blob in object folder? I thought it is one file per one blob, at least for new files?

I think that if A and B have some parts that are the same, but each of the entire files is different, then git won't split those parts out as a single blob. However, if they are exactly the same contents, then they can share a blob. — Code-Apprentice, Dec 06 '19 at 22:57
*"For revisions of A or B, I guess only incremental diff are stored"* -- that's how Subversion works, internally. Git keeps the entire content of each file for each revision. It computes diffs when it needs to process merges, cherry-picks, rebases etc. — axiac, Dec 11 '19 at 15:18

score 2 · Answer 1 · answered Dec 07 '19 at 00:20

If I add into one git repository two files A,B in one commit, A and B have overlapped contents ...

It's not clear to me what you mean by "overlapped contents" here. Perhaps you mean identical contents?

will git compare the two new file A and B?

Only if and when you tell it to do so—but see below for more details on a blob object.

For revisions of A or B, I guess only incremental diff are stored ...

That is not the case.

Let's look closely at what's stored in a commit. Here is commit 083378cc35c4dbcc607e4cdd24a5fca440163d17 in the Git repository for Git (though I've replaced @ with to maybe, perhaps, cut down on spam delivered to Junio Hamano):

$ git cat-file -p HEAD | sed 's/@/ /'
tree 79674d33d6f9f2c9ff29258f8c748aa785de8dc3
parent 88bd37a2d0f9ed504ac49fcecf6371d9fafc2a67
author Junio C Hamano <gitster pobox.com> 1575578639 -0800
committer Junio C Hamano <gitster pobox.com> 1575579169 -0800

The third batch

Signed-off-by: Junio C Hamano <gitster pobox.com>

That's actually the contents of the commit object. Note the tree line at the front: we can now look at the tree that holds this commit, using git cat-file -p 79674d33d6f9f2c9ff29258f8c748aa785de8dc3 or git ls-tree 79674d33d6f9f2c9ff29258f8c748aa785de8dc3. The output is the same in this case, except that if we use git ls-tree, we can have it recurse into any sub-trees within the trees.

We'd like the recursion, because that shows every file stored in the commit. So we'll use git ls -r on this. I won't quote the result, though, as it's over 3000 lines:

$ git ls-tree -r 79674d33d6f9f2c9ff29258f8c748aa785de8dc3 | wc -l
    3680

So this commit in Git mentions 3680 stored files, symlinks, and submodule hashes. We can group them by their stored mode, which is the first field of each line in the output:

$ git ls-tree -r 79674d33d6f9f2c9ff29258f8c748aa785de8dc3 | cut -f1 -d' ' | sort -u
100644
100755
120000
160000

If it is content based, what is the unit of blob in object folder?

A blob, or more precisely, an object of type blob, is one that holds some data. The 100644, 100755, and 120000 mode objects above identify blobs. (The 160000 object is a gitlink for a submodule and is not very interesting here.) Let's look at the actual symlinks, as there is in fact only one:

$ git ls-tree -r 79674d33d6f9f2c9ff29258f8c748aa785de8dc3 | grep '^120000 '
120000 blob 091dd024b349d6bc908371eddb7c594059c4fd70    RelNotes

Now let's see what's in this blob object 091dd024b349d6bc908371eddb7c594059c4fd70:

$ git cat-file -p 091dd024b349d6bc908371eddb7c594059c4fd70
Documentation/RelNotes/2.25.0.txt$

(note the lack of a final newline). This blob holds the target of the symlink named RelNotes.

Compare with, for instance:

$ git rev-parse HEAD:GIT-VERSION-GEN
22e8d83d98551298b769022f6fdd606225c34be5
$ git cat-file -p 22e8d83d98551298b769022f6fdd606225c34be5 | head -4
#!/bin/sh

GVF=GIT-VERSION-FILE
DEF_VER=v2.24.GIT

So for a file (mode 100644 or mode 100755), the blob object holds the file's data.

The name of the blob object is its hash ID, just as the name of any Git object is its hash ID. The hash ID is computed based on the object's type and content:

$ python3
...
>>> import hashlib
>>> h = hashlib.sha1()
>>> data = open("GIT-VERSION-GEN", "rb").read()
>>> len(data)
754
>>> h.update(b'blob 754\0')
>>> h.update(data)
>>> h.hexdigest()
'22e8d83d98551298b769022f6fdd606225c34be5'

That content is why the hash ID of GIT-VERSION-GEN is 22e8d83d98551298b769022f6fdd606225c34be5: it's the result of doing the SHA-1 checksum algorithm on the literal string blob 754 (where 754 is the number of bytes of data), followed by an ASCII NUL, followed by the data bytes themselves.

Hence, if you know in advance that a file will contain this data—any file—the hash ID of the blob for that file will be 22e8d83d98551298b769022f6fdd606225c34be5.

After all this, we can go back to your original comment and question: if files A and B in your commit have the same content, their tree entries have the same hash ID. If they have different content, their tree entries have different hash IDs.

It's the tree entries that supply the names (A or B) and mode strings (100644 = not-executable, 100755 = executable) for these two files. Any commit that you make that stores files A and B will store two tree entries for them. The hash IDs in those tree entries will be those of the blob object (repeated twice) or objects (each different) that hold the contents of A and B (which are either the same, or different).

Git did not compare the contents of A and B to get here. Git simply said: I need a blob object to hold the contents of A, computed the checksum, and discovered whether there was already such a blob object (which then gets reused) or not (in which case the temporary object to hold the content "goes live", as it were, once the commit happen). Then Git did the same thing for file B. If the content in B is the same as that in A, then by the time Git finishes calculating the checksum, the object definitely exists already, and Git just re-uses it.¹

Once that object's hash ID is in a tree whose hash ID is in a commit whose hash ID is reachable in the repository, that object will remain in the Git repository. That is, Git's garbage collector, git gc, runs occasionally and does the following:

find all tag objects reachable from any tag name or other reference
find all commits reachable directly by branch name, tag name, or any other reference, or by any reachable tag object
find all commits reachable, recursively, by reachable commits
find all trees reachable from any reachable commit or tag, or, recursively, any reachable tree
find all blobs reachable from any tree or tag, or from any index entry

(all the "or tag" items above are because both lightweight and annotated tags can point directly to any of the various object types, though of course a lightweight tag that points to an annotated tag object is just called an annotated tag).

All of these objects are reachable. (Note that there are per-worktree references, including the per-worktree HEAD, and per-worktree index files; git gc failed to scan these from the time added worktrees were introduced in Git 2.5 until this bug was fixed in Git 2.15.) Reachable objects are retained. Unreachable objects can be deleted, provided other criteria are met (prune time and various packing issues).

Each new commit stores a full and complete snapshot. The snapshot is produced by writing out the index as a series of Git tree objects, with the top-level tree holding the objects whose content will go into the top level of the resulting work-tree if the commit is checked out. (The actual git checkout process works by first reading the tree into an index representation, which is what expands out the various path names, in the case of trees within the top level tree. In this one sense, Git sort of does store directories, but they're not annotated with permissions, and internally, Git flattens them all out into the index first, so that it only has to deal with files.)

¹What if two files hash to the same blob hash ID? The answer is: Git can't store both files. Git just assumes that this never happens—and so far, that's worked. See also How does the newly found SHA-1 collision affect Git?

Objects aren't necessarily stored separately

If you take a big file (many megabytes, for instance) and make a small change to it and store the result in a new commit, you initially get two separate blob objects in what Git calls loose object format. These two objects, as stored in the .git/objects directory, are zlib-compressed, but they will probably still be fairly large.

After objects have been in the repository for a while, though, Git's garbage collector runs git repack.² This collects up the individual object files and compresses them further. It uses a form of delta encoding that does not depend on text-file format: binary files can be delta-compressed here. Once some object is packed, parts of it may be shared with other objects that use it as a base object. Describing this process accurately is very hard.³ In general, though, those large blobs will be delta-compressed in pack files.

The result, though, is that at the object level, two large objects are completely distinct. At the pack level, they might have parts that are co-mingled (or "overlap" as you put it above). But no object can ever be changed: its identity is its hash ID, which is determined entirely by its content. So it's safe to do this, as long as the base object is never removed from the pack. (No pack can ever be changed either, so that's not a problem. Packs can get too big, and that is a problem.)

²This can be tuned, or even disabled; see the documentation.

³Solving the problem of packing perfectly is too hard, so Git uses some heuristics, which are documented here.

Thx. I was thinking A and B are not identical but for most parts they are the same. I guess the pack part is more relevant. — ahala, Dec 09 '19 at 21:15

score 0 · Answer 2 · answered Dec 11 '19 at 15:06

The direct answer is that unless you use git add -p to add chunk specifically, the default is that the unit of object is by file. If A and B are not identical, even if they share common parts, they are stored separately in git by default, which are called loose objects. But git may use pack file to save space as in the second part of answer above.

git content tracking and unit of contents

2 Answers2

Objects aren't necessarily stored separately