5

I have added a git submodule to my git repository and it works fine.

In my "parent" repository I have created a feature branch: myfeature that requires some changes to the sub module. But I don't want to affect other teams using the same sub module. Therefore I have created a corresponding feature branch on the submodule repository submodule-feature with some changes. I have then added/committet the change from the submodule directory followed by the same in the root of the parent repository.

But when I switch back to master on my "parent" repository the submodule is still on the submodule-feature feature branch. That is not what I expect. Because now when I run my tests on master they fail because I have introduced some breaking changes in the submodule on the submodule-feature branch.

Is it not possible to lock the branch of a submodule to the parent repository branch?

EDIT: Based on: How can I specify a branch/tag when adding a Git submodule?

Looks like I can specify a branch for the sub module repository in .gitmodules

[submodule "mysubmodule"]
    path = mysubmodule
    url = https://bla.git
    branch = submodule-feature

And adding the following additional git behavior in jenkins:

enter image description here

and:

enter image description here

it clones/checkout the submodule-feature branch when running a build on the parent myfeature branch.

But that will of course require some manual steps when working locally. But from CI side its very easy to accomplish.

u123
  • 15,603
  • 58
  • 186
  • 303
  • I have never seen a provision for locking the branches as such. though, one of the ways I did it in one of my past projects was by writing a bash script customized to cater my directory structure. – AppleCiderGuy Mar 02 '19 at 20:44
  • Ok thats odd, so what if I want to do some experiment changes on the submodule that breaks stuff, then all other developers will be affect by this when they update their repo including the submodule? Except if I start hacking on init scripts that puts all in the right order. – u123 Mar 02 '19 at 20:50
  • Ideally your master from your super module should always be pointing to masters of submodule. you can just follow that as a practice and deny all PRs that violate it, unless, that's what you want to do (run a test). it Highly depends on your GitFlow though! something like `forking` would allow you to keep it restricted to your repo. – AppleCiderGuy Mar 02 '19 at 20:59

1 Answers1

12

The short answer is just mostly "no". While Jenkins has a checkbox, it may not be a good idea to use it here—whether it is or not, depends on who controls the name-to-ID-mapping. Other CI systems may or may not have similar checkboxes. To see what I'm getting at here, read on.

The philosophy of submodules is that the superproject is in control of its submodules. This part, I think, is neither surprising nor objectionable to anyone. But the key lies in the way the superproject controls each submodule. This part does surprise people, and the reason is fairly simple. It's a basic misunderstanding of Git repositories in general.

People think that what matters in Git repositories are branches, or more precisely, branch names like master and develop. That's simply not true. These branches, for the most part, are virtually irrelevant here. For humans, these branch names serve an enormous, overriding purpose. For Git, they serve a mostly-trivial point that's equally well covered by any other name, such as a tag name, or a remote-tracking name, or refs/stash, or HEAD@{17}.1

In Git, the commit, not the branch name (nor tag name, nor any other name), is the central, essential thing. Commits are Git's raison d'être. Without commits, Git has no function. With commits, Git is useful. Commits are actually identified by their hash IDs, whose true names are those big ugly strings like b5101f929789889c2e536d915698f58d5c5c6b7a. Silly things like readable names, like master or develop, are for the weak, the biologicals ... the humans.

Of course, we, being weak humans, like our names. So we use them in our repositories. But when we have a repository that, like a superproject, is controlling another repository, such as a submodule—well, in this case, there are no humans involved. So Git uses the commit ID to control which commit hash ID is extracted in each submodule.

So this is where the surprise comes in—except, once you understand where Git is coming from, it's not surprising at all. When you let the superproject choose the submodule commit, the superproject chooses the submodule commit by hash ID. Any branch names are irrelevant. The hash ID is precise and is always correct. Branch names are sloppy—they move, on purpose, from commit to commit, over time. One commit hash ID can have zero or more branch names that either point directly to it, or can reach it through the commit graph.2

Each and every commit you make in the superproject records the exact submodule hash ID that the submodule is expected to have checked out. Hence, when you git checkout some commit in the superproject, you should generally immediately have each submodule do its own separate git checkout by the hash ID specified in the superproject.3

Remember that each submodule is a Git repository of its own, so it has its own HEAD, index and work-tree. The index in the submodule records the files that are checked out into the work-tree of the submodule, and the HEAD in each submodule is in detached HEAD mode, recording the hash ID of the currently-checked-out commit. It's the superproject's Git that chooses this hash ID—by storing it in a commit in the superproject—and it's the submodule's Git's responsibility to have this particular commit checked out. Nowhere in this process is there any mention of a branch name. Branch names are irrelevant!


1The in-Git function of names, besides of course providing a crutch for weak humans, is to protect objects from being garbage collected. An object is vulnerable to collection if it is not reachable from some name. Since most commits are mostly chained together, one name tends to protect most of the commits in a repository. See also footnote 2.

2For more about reachability, see Think Like (a) Git.

3This doesn't actually happen automatically by default. You have to use git checkout --recurse-submodules or set submodule.recurse in your configuration. Depending on what you're doing—especially, if you're trying to update the submodules—having it happen automatically is either convenient, or extremely annoying.


Why, then, can you set a branch name in the first place?

As you noted, the .gitmodules file can record a branch name. You can also copy this into .git/config (the .git/config setting overrides the .gitmodules setting, if both are set.) But typically, the submodule isn't on a branch at all; it's put into detached HEAD mode, as described above. So what good is this branch name?

The first, but somewhat unsatisfactory, answer is: It's no good at all. Most operations just don't use it.

The second, more satisfactory answer is: A few special-purpose operations do use it. Specifically, if you are in the process of updating the superproject, and want to make a new superproject commit that records a new submodule hash ID, you need some way to pick out the new submodule commit hash ID. There are multiple ways to go about this, and the name is designed for use in one of those ways.

Suppose, for instance, that the submodule is a public repository (perhaps on GitHub) that you don't control. You just use it. Perhaps twice a year, or perhaps 50000 times a day, someone updates the GitHub repository. The new commit(s) they put on their master or their develop or whatever, break(s) a bunch of stuff you use, but that's not a problem, because your superproject doesn't say "get me their latest master or develop commit", your superproject says "get me commit a123456...", and a123456... is always the same commit, forever, until the heat death of the universe, or we stop using Git, whichever occurs first. But, while breaking a bunch of your own sh— err, software, they've introduced a cool new feature that you must have.

What you would like to do at this point is have your Git, the one that's controlling your submodule too, tell your submodule Git: Go get me their latest master or develop or whatever name I recorded earlier. Since you did record that name, you can direct your Git to direct your submodule to do that using:

git submodule update --remote

(to which you can add some extra flags like --checkout or --rebase or --merge, but I'm not going to get into these details—I'm going to assume for now that you just use their latest directly). Your Git has your submodule Git run git fetch and then updates your submodule repository to their latest commit as directed by your submodule's copy of their branch name. (There are now at least three Gits involved in all this—your superproject, your submodule, and the Git repository on GitHub—so it's a little complicated. They, whoever they are, probably have one or more Git repositories they use to control the GitHub one, but at least you don't have to deal with that. Well, not yet.)

Now that your submodule is updated, you must fix up your own code, both to use the the new feature and to deal with all the breaking changes they made to stuff you were already using. So you do all of that, building and testing your software—all on your local machine: there's no CI involved here, not yet—and get it all working. Now you can git add your changes and git add the name of the submodule. Your superproject's index and work-tree now all match up and you are ready to make a new commit in your superproject.

Note that git add submodule-path merely told your Git to record, in your index, the hash ID of the commit that is currently checked-out in the your submodule Git repository. Once again, the branch name, if any, is irrelevant. It doesn't matter if your submodule repository is on branch master or develop, or has a detached HEAD; all that matters is the raw commit hash ID.

You now run git commit to make a new commit. The hash ID from your index, which controls which commit is going to be considered the "right" commit for the submodule, is the commit hash ID you recorded by running git add submodule-path. In this case, that commit ID got selected by the fact that you ran git submodule update --remote earlier. But the only thing that matters is the hash ID in your index, which goes into the new commit.

Now you can git push this commit, that you have made in your superproject Git repository, to some other system, such as your CI system. It can git checkout this commit, and this commit records the right submodule hash ID.

How can I combine this with a CI system so that the CI system picks the hash ID?

This is, obviously, a lot harder may be harder, depending on whether your CI system offers it as a feature.

Now that you know how this is all constructed, though, you have the tools you need. You must have your CI system update (or obtain) its clone of the superproject. That superproject contains, in its .gitmodules file, the URL and path for any submodules that the CI system must clone as well. It may or may not contain some branch name(s) for those submodules.

The CI system must now direct some Git—the superproject Git or the submodule Git—to have the submodule Git git checkout some commit other than the one already recorded as the correct commit, so that the superproject is no longer using the commit that the CI system checked out. In other words, you're no longer building what you submitted to the CI system. You are having the CI system build a new Frankenstein's-monster out of body parts: the main body from your commit, but a limb taken from some other commit that you didn't specify directly: instead, you're allowing someone else to specify which commit goes there. You gave your CI system a name and told it to ask them, whoever they are, what hash ID that name goes to.

Your CI system can now attempt to build and use this Frankenstein's-monster. If it all works well, your CI system will need to make a new commit, that's a lot like your commit except that it records the hash ID it got from them—whoever they are, again—for the submodule in question. Your CI system probably now also needs permission to push this commit somewhere, unless your CI system is also your main repository source-of-truth.

torek
  • 448,244
  • 59
  • 642
  • 775
  • Actually its very simple to get it to work with CI (in this case jenkins) with specifying a few additional options to git behavior, see my updated post. And that will indeed respect the branch set in the .gitmodules file. – u123 Mar 03 '19 at 11:07
  • @u123: Ah, in this case your CI system has a checkbox for its "extract commit from SCM" that tells it to use `--remote`. But note that allowing someone else to dictate which commit of theirs you use is not always a good plan. In particular, suppose Jenkins says that the build is good, but the name-to-hash-ID translation for the submodule has *changed* since then? I believe Jenkins won't automatically make a *new* commit recording the hash ID of the successful build, and that's a key step here. – torek Mar 03 '19 at 16:25
  • Regarding the latter part - assuming I understand you correctly - the second checkbox makes sure you are always on the tip on the branch specified in the sub module. – u123 Mar 04 '19 at 13:41
  • @u123: The problem is, it cannot do any such thing. It can change the commit *it* checks out. It cannot change the commit *you* check out. The whole point of having a superproject save the commit hash ID of the submodule is to pick an **exact** commit: this one and no other. Using a branch name is nearly the opposite of that: *pick any old commit, as directed by some random site off the net.* Sure, yesterday (or an hour ago) the person controlling some upstream URL said branch X is commit `a123456`, but what does that person say *now?* [continued] – torek Mar 04 '19 at 14:09
  • The CI system tested the build yesterday, or an hour ago, or a minute ago. If all went well, your system has been tested with commit `a123456` (for instance). But *now* branch X means commit `b789abc`. The only way this whole scheme can possibly work is if the branch-name-to-hash-ID mapping is something *you* control, e.g., perhaps you own both the superproject *and* the submodule, and you tell everyone *don't touch* while you let your CI system work. – torek Mar 04 '19 at 14:11
  • To put it another way: who is the *you* in the "you are always on the tip of the specified branch"? At the same time, who is the one who says what the tip of that branch is? The *you* needs to mean *both you-a-human **and** the CI system*, and the second person—the one in charge of the submodule—must also be well-controlled or well-disciplined so as not to have you-the-human and you-the-CI-system get out of sync. – torek Mar 04 '19 at 14:15
  • Nice answer but it would be even better when there were examples supporting the description (which are sometimes hard to grasp imho) – ViRuSTriNiTy May 13 '20 at 07:32
  • @ViRuSTriNiTy: the problem is, each CI system has its own quirks. When using Jenkins, those checkboxes do nothing at all if you've written your own Jenkinsfile, sometimes. It's ... annoying and there is virtually zero documentation; we've had to try a lot of different things. – torek May 13 '20 at 07:48
  • @torek I wasn't referring to Jenkins or a CI. The description of how git works internally is top notch (that hash stuff you described) but it is hard to grasp for a beginner imho. Examples with actual commands and results would help. – ViRuSTriNiTy May 13 '20 at 13:19