Let's start by noting that a submodule is a Git repository in its own right. As a repository, it has everything a repository has: commits, branch names, tag names, an index, a work-tree, and so on. At the same time, it's a submodule, surrounded in some way by another repository.
Git probably should have direct support for "subrepos".1 It doesn't. Subrepos is a term I made up here to tell these (unspecified) things apart from Git's submodules. I am not the first to make up this term; searching for "subrepos" will find several actual implementations (all different).
Submodules are designed to be detached
Git thinks of1 submodules as external dependencies: something outside your project and outside your control. In your project—which Git now calls the superproject, to distinguish it from the submodule Git repository—you will record:
- The URL for the repository to clone: the argument you would give to
git clone
.
- The place in your superproject where the clone should go.
These two items go into a file named .gitmodules
in the top of your repository. In case you want more than one submodule, each one has a name:
$ cat .gitmodules
[submodule "path/to/sub1"]
path = path/to/sub1
url = ssh://one.example.com/repo1.git
[submodule "path2/sub2"]
path = path2/sub2
url = ssh://two.example.com/repo2.git
The path
is somewhat redundant since the submodule's name is the path name, but that's how these are laid out. In any case, Git will, in effect, make an empty directory at the path
, cd
there, and git clone
the URL into that initially-empty directory.2
There is something missing here. When you clone a repository, you also git checkout
some particular commit, typically by a branch name like master
. You do this so that the work-tree for that repository is populated with files from that commit. (Git actually copies the files to the index for that work-tree first, but the index entries are tiny, and if you never do any work in the sub-repository, you won't care about this.)
So: what commit should Git check out, in the work-tree for this other repository?
1Don't anthropomorphize computers—they hate that!
2In older versions of Git, Git literally does that, and the sub-module repository is quite indistinguishable from any other repository. That method still works, but modern Git now puts the .git
directory for the submodule inside the .git
directory for the superproject. Instead of a .git
directory, it writes a .git
file that contains the path of the submodule's .git
container. This makes it possible to discover, from the submodule, that the submodule is in fact a submodule as well as an independent Git repository.
Names of commits
Git has many ways to name a commit. For a complete list, see the gitrevisions documentation. You can always translate any of these names to a raw commit hash ID. In fact, the git rev-parse
command is designed to do just that:3
$ git rev-parse master
0afbf6caa5b16dcfa3074982e5b48e27d452dbbb
We already noted above that git clone
ends with a git checkout
, usually of a branch name like master
. However, you can run git clone -b tag-name
, where git clone
runs git checkout tag-name
. If you do this, you will find that you are in "detached HEAD" mode.
If you run git clone --no-checkout
, the clone step doesn't run git checkout
at all, and now you can use git checkout hash-id
to check out a specific commit. Once again, you will be in "detached HEAD" mode.
Hence, using a name to check out one specific commit is, with one important difference, basically the same as using git checkout hash-id
to check out the commit to which the name points—the hash ID you would get from using git rev-parse
on the name.
The one important difference is, of course, that if the name is a branch name like master
, Git will put you "on the branch", rather than in "detached HEAD" mode. (You can force detached HEAD mode anyway using git checkout --detach name
, if you want.)
The important thing to note about these many names for a commit is that in the end, the names just translate to a hash ID. It's the hash ID, not the name, that truly identfies that particular commit. In fact, the point of a branch name—as opposed to, say, a tag name4—is that the branch name changes over time: it may point to 0afbf6caa...
today, but to 468165c1d...
tomorrow.
So a submodule is a Git repository that we don't control, for (say) a library of various functions. Suppose our superproject uses that library, and works with one particular version of that library: v1.1
which is a123456...
. What we want to record, then is:
- the URL of the repository;
- the path where the repository is to be checked out; and
v1.1
or a123456...
: the specific hash ID to check out, as a detached HEAD.
This way, when we git checkout master
in the superproject, whatever our master
hash ID turns out to be, we want Git to git clone
the subproject if necessary and then check out the one specific commit a123456...
. This will be a detached HEAD in the submodule!
Over time, we will work on our own superproject. We may need a new feature from v1.2
of the library, which is commit b789abc...
rather than a123456...
. So we'll go into the submodule, git checkout v1.2
to get a new detached HEAD commit, go back into our superproject, make everything work with the updated submodule, and then git add
(to the index for the work-tree for the superproject) a record saying use b789abc...
. We will then git commit
to save the new pairing: the new commit uses b789abc...
.
It doesn't matter how we got b789abc...
checked out in the submodule. What matters is that the superproject now works with that commit. We want superproject commits from here on to say use that specific commit from the submodule.
3The rev-parse
command can do a lot more than just turn a name into a hash ID, but that's one of its main functions.
4Tag names, unlike branch names, are not supposed to change. Tag names also have the advantage of being something we choose: v2.17.0
is probably more meaningful than 468165c1d...
. It might be nice if Git submodules could be specified by tag names, but they can't (at least today).
Conclusion (or is it?)
Specific commits are detached HEADs. Superprojects use specific commits out of their submodules. Therefore, superprojects call for detached HEADs in submodules. That's it—that's why Git is designed to work this way.
The main thing that this doesn't address is the updating process. We mentioned above that at some point, we want to update the freeze-point within the submodule. To do that, we will cd
into the submodule work-tree and start doing Gitty things like git checkout branchname
, maybe with a git fetch
first. At this point names—branch and/or tag names—may become important. It might be nice to record them in the superproject somewhere.
That's what adding branch = name
to a superproject's .gitmodules
entry for a submodule does. Newer versions of Git have some support for this: in particular, git submodule update
can (depending on options) use it. The documentation for this is rather difficult to read. It helps a lot to keep in mind that each submodule Git repository is a Git repository all on its own, as well as being a submodule. Since it is a repository, it can have its own origin
remote as well as its own branches, so that running git fetch
updates origin/master
, and so on.
As far as the superproject's Git is concerned, though, the submodule actually has one specific commit checked out. This is true whether or not the submodule Git is in "detached HEAD" mode. The commits you make in the superproject record one specific hash ID, every time you make a commit. That one hash ID is the one stored in the index for the work-tree for the superproject. ("The one stored in the index for the work-tree for the superproject" is kind of a mouthful, and worse, some of the usual tools for inspecting the index, of which there are not that many, don't really work very well for submodules.)