How is a monotree organized with git?

Question

I've recently came across an article by Greg Kroah-Hartman on why the Linux Kernel has not a stable API and how the Kernel repository is organized as a monotree. When I discussed the article with a friend it became clear that we had a different understanding of what the term tree applied to:

tree refers to different sub-folders of a project.
It refers to the different forks of the git master branch.

In the first case contributors would not checkout the complete project, e.g. the Linux Kernel, but only a sub-folder. These could then be combined with e.g. git-subtree.

In the second case contributors would have to checkout the complete project and basically create fork of a monorepo.

So what does tree in monotree refer to and how can a project be organized as a monotree with git?

The first case is more correct but the root folder of the repository is also tracked as a tree. Run `git log -1 --pretty=raw` and you can see the root tree of a commit. It has a hash name like a commit. Run `git ls-tree ` and you can see what the very tree has. It can have blobs, trees and commits. Blobs with file names represent files. Trees with folder names represent foders. Commits represent submodules. They are the things you can see in the root folder. What a subfolder has? Then run `git ls-tree `. You could add `-r -t` to see the whole picture. — ElpieKay, Jan 23 '18 at 16:04
Thanks. This cleared things a bit. I should look into the architecture of git a little more. — Karsten, Jan 24 '18 at 09:51

score 3 · Answer 1 · answered Jan 23 '18 at 18:22

Let's make a few notes here:

The phrase monotree, or even the partial word mono, never appears in the referenced article.
The article has seven occurrences of the word tree.
In six of these seven occurrences, the entire phrase here is the main kernel tree. The one reference that does not use this full phrase just says the tree but clearly has the same intent as the other six.
You have tagged this with git linux monorepo (in case the tags change).

Your question amounts to either: What does the author mean by the phrase "the main kernel tree"? or What do people in general mean when they refer to a tree? These are valid questions but not particularly relevant to Git.

Tree in computer science tends to refer to the data structure, which is also pretty loosely defined; see the wikipedia entry. We have some collection of nodes and edges—mathematically, a graph G defined by its set of vertices V and edges E, where each vertex connects by edges to other vertices—and there are constraints on the graph so that it is minimally connected, or equivalently, maximally acyclic. (See https://en.wikiversity.org/wiki/Introduction_to_graph_theory/Proof_of_Theorem_4 and the answers to What's the difference between the data structure Tree and Graph?)

A tree object in Git specifically refers to the stored Git object of Git-type "tree" (one of four Git object types that are stored in the repository database—the other three are commit, blob, and annotated tag). Such an object stores <mode, name, hash-ID> triples, where the mode and hash-ID identify additional Git objects to associate with the name, which is an arbitrary¹ string of bytes excluding NUL and slash (codes 0 and 0x2f or 47 respectively). A commit object stored in Git includes the hash ID of a single tree object. Reading the tree object and locating the sub-objects it lists, then recursively reading their own sub-objects if those objects are trees, results in constructing the minimally-connected graph that is a CS-style tree.

¹There's a length limit due to the cache entry ce_namelen field, which has a 32-bit integer type. So no name component can exceed about 4 GB in length. Practically speaking, none should probably exceed 255 bytes, but tree objects in Git don't enforce any particular limit, as far as I know.

A file system tree in Linux is really just a string identifying an entity within the file system, though naming anything other than a directory results in a degenerate tree with just one node in it. By naming a directory, though, you can imply that anyone interpreting this string should read the directory's contents, which are names that (by being concatenated with the string identifying the directory itself) name another Linux file system tree, possibly a degenerate one with a single file or device node or whatever. This kind of recursive enumeration leads to building up a minimally-connected graph, just as with the Git tree object. (Perhaps unsurprisingly, the Linux directory objects have essentially the same constraints on names as the Git tree objects, though they usually have a much smaller maximum component name length, typically 255 bytes or fewer.)

Finally, the way the phrase the main kernel tree is used in the article refers to the Linux kernel repository—Linus Torvald's Git repository for the Linux kernel—and the entire ecosystem around it. There is a lot of room for arguments about the details. Here, I will just include a link to this particular InfoWorld article, which seems like a reasonable summary of the state of affairs as of the time it was written (August 2016).

Thanks for this elaborate answer. I was not clear enough with my question. I do know what trees are in Computer Science. However, When I read about the The Kernel repository being organized in a monotree, I wonder what they mean by that. @ElpieKay's comment made it clearer to me. — Karsten, Jan 24 '18 at 09:49

How is a monotree organized with git?

1 Answers1