Git tree object and git terminology

Question

I've been learning about git and I'm quite confused by the terminology.

Do I understand it properly that a "tree object" is really something like a "folder object"? It keeps information of things inside it (blobs) and other trees (sub-folders). It keeps information about the "actual data" of the project we are working on.

At the same time, the structure of commits/versions has a tree like structure (directed acyclic graph really, with merges, but that's just a detail), and paths to a leaf in this tree could be called branches. "Branches" in git however, are actually just pointers to commits though.

Do I understand this right? Is it just me or is "tree objects" a pretty misleading name, given the already existing tree structure of the "version tree structure" ? Even if you wanted to use the word tree, it would make more sense to call it "tree node object" or something - since a tree object in git doesn't seem to contain a whole tree, just some blobs and a pointer to other trees. The name branches also seems misleading, for similar reasons.

Incidentally, for more about the poor overworked word "tree" in VCSes and in computing / informatics in general, see the side note currently on p. 21 [here](http://web.torek.net/torek/tmp/book.pdf). — torek, Dec 02 '19 at 02:14
Might as well complain about the use of the word "set". There's lots of words that need context to disambiguate. Once you've learned to be comfortable with `K` meaning variously potassium, kelvin, kilo (both 1000 and 1024) generally, kilometers specifically, and vastly more, all the context-dependent uses of "tree" won't be hard to get on top of. — jthill, Dec 02 '19 at 02:37
@jthill I partly just wanted to check whether I get the basic structure of git right, and I'm not complaining or implying it should be done some other way. If some programming language decides to use the word "linked list" for the node of linked list, I think it's ok to wonder whether that isn't misleading, ask about it, and see if maybe you're not missing some reason why it makes perfect sense to call it "linked list". That's really what I'm doing - I want to make sure I understand it right. — John P, Dec 02 '19 at 09:12

score 1 · Accepted Answer · answered Dec 02 '19 at 02:07

Except for the user-facing documentation's insistence on using the word tree-ish (if that even is a word), the term tree is internal to Git, so it shouldn't matter what they call it: tree, or marplot, or gripsack, or whatever you like.

That said, a tree object, inside Git, is simply one of the four object types. What it contains is a series of entries, with each entry holding three items:

a mode: an octal number, terminated with ASCII space, with no leading zeros, that describes the type of the entry and gives the x bit for regular files;
a name: a byte-sequence terminated with an ASCII NUL ('\0' in C, b'\0' in Python); and
a raw hash ID: 20 unencoded bytes.¹

The name in a tree object is really just a name component. If the mode entry is 40000, the hash ID must be that of another tree object. If the mode is 120000, 100644, or 100755, the hash ID must be that of a blob object. If the mode is 160000, the hash ID is expected to be a commit object as stored in some other Git repository, i.e., a gitlink. Other modes are generally not allowed, though git fsck allows 100664 as this mode appears in some existing (very old) repositories.

The file name of a blob or (mode 120000) symbolic link is constructed by stringing together the name components of the tree objects that led to the blob, with slashes appended, and then adding the last component in the final tree object. That is, if the top-level tree object for some commit is T₀, and the blob or symlink appears directly in T₀, then the entry gives the name of the file that will hold the blob or symlink.

But if T₀ has an entry foo with mode 40000 and hash T₁, Git will go on to read tree object T₁. If that has an entry bar with mode 100xxx or 120000, the blob object will be a file or symlink whose name is foo/bar. Hence the file's path name is produced by traversing tree objects until reaching a leaf.

For a gitlink (tree entity with mode 160000), the constructed path name gives the submodule path that Git will check for in .gitmodules, if we must clone the submodule, and the hash ID is the commit we'll git checkout as a detached HEAD in that other Git repository. For all other entities, the hash ID should be that of an object in this Git repository, otherwise the tree object is incorrect or the repository is inconsistent (or both).

As someone using Git, you do not have to care about any of this: just put files in the index as usual, and use git write-tree to write everything. Use git read-tree to grab a tree by the hash ID in a commit, to fill the index² from that tree. Use git show or git cat-file to obtain a single file's contents using either a hash ID (blob hash) or a path name (commit-hash:path, which git rev-parse can translate, and for a long time now, git cat-file can handle as well).

¹This is kind of a mistake, because when Git goes to using longer hash IDs in the future, either the tree objects may have to store truncated hashes, or we'll need a new flavor of tree object. Note that Mercurial's internal tree data structures left more room. Git probably should have used an ASCII-ized hex digest terminated by another NUL. But there are enough other thorny issues here to be resolve that this one is kind of minor.

²If you set GIT_INDEX_FILE, git read-tree will read the tree into the alternate index whose path name you provided.

Tree-ish is using a convention from I think the Algol 68 report, maybe even older, the `-ish` usage on the end meaning anything that can be unambiguously converted to one. So commits and most tags are also tree-ish. — jthill, Dec 02 '19 at 05:53
@jthill: sure, but the point is that this introduces the term *tree* (or tree-ish) to the end user. Fortunately the gitglossary has a definition (for tree, tree object, and tree-ish). Unfortunately not enough Git pages have cross-references to gitglossary. — torek, Dec 02 '19 at 06:03

score 1 · Answer 2 · answered Dec 02 '19 at 06:02

The very first commit of the Git repository (Apr. 8th, 2005) referenced a "tree object" as

a list of permission/name/blob data, sorted by name.
In other words the tree object is uniquely determined by the set contents, and so two separate but identical trees will always share the exact same object.

10 days later:

A "tree" object is an object that ties one or more "blob" objects into a directory structure. In addition, a tree object can refer to other tree objects, thus creating a directory hierarchy.

I mentioned last month an upcoming Git 2.25 tutorial on object enumeration (commit e0479fa) which uses the trees.

The object walk is a key concept in Git - this is the process that underpins operations like object transfer and fsck.
Beginning from a given commit, the list of objects is found by walking parent relationships between commits (commit X based on commit W) and containment relationships between objects (tree Y is contained within commit X, and blob Z is located within tree Y, giving our working tree for commit X something like y/z.txt).

Git tree object and git terminology

2 Answers2

Linked