13

Is there a way I can only pull the latest commit in a git submodule? I was trying to put boost as a git submodule in some projects but since the boost repo with everything included is really heavyweight I wanted to only update the submodules to the latest commit and not pull all commits. Is this possible?

For example, when I do

git submodule update --init --recursive

All the boost submodules get pulled with all their commits. Can I only ask the submodules to mirror the latest commit instead of pulling all changes?

Note Shallow clones with the --depth flag do not work because that only pulls the latest commit, and the latest commit has only the changes made in that commit, so the repository is not in the right state.

Note git archive (as suggested in an answer below) does not seem to work when I try the following sequence of commands

mkdir temp-git-test
cd temp-git-test
git init
git submodule add --depth 1 https://github.com/boostorg/boost
cd boost
git archive --format=tar HEAD --output ../boost.tar.gz
cd ..
tar -xzvf boost.tar.gz

The output of the unzipped repo is the same as the submodule. Am I doing something wrong?

Curious
  • 20,870
  • 8
  • 61
  • 146
  • "the latest commit has only the changes made in that commit, so the repository is not in the right state." — this is false. Commits are complete snapshots. – jthill Dec 12 '16 at 04:12
  • I didn't know how else to describe it, my git vocabulary is a bit weak, could you suggest an edit to the question and I will accept it? – Curious Dec 13 '16 at 00:55
  • The thing is, `--depth=1` is exactly what you want. – jthill Dec 13 '16 at 00:57

2 Answers2

8

The short answer is no. The long answer is maybe, but consider another way.

Shallow clones and shallow submodules

The long answer, which lets you get partway to what you want, starts with a technical note: you're not pulling, in Git terms. In Git, "pull" means "fetch, then merge-or-rebase" and you are not going to merge-or-rebase here. In fact, when you're "init"-ing you are generally going to make the initial clones.

Each submodule is actually its own repository.1 Git is, sooner or later, going to do a git checkout within each of those repositories, asking it to check out, not a branch, but rather one specific commit, which is quite often not the latest commit. Given the nature of Git repositories and software development, and the idea that a submodule is, in the first place, a reference to a third-party repository, i.e., one you specifically do not and cannot control, the best you can do is say: "I know that my software works with one specific version of their software, and that version is <fill in the blank>." Thus, your repository lists the specific version you want from their repository.

Now we get to the heart of the problem. When you git clone a repository, or use git fetch to update an existing clone, you do so by asking for specific branch and/or tag names, rather than specific commit IDs. There is some (very limited) support for fetching specific IDs, but it must be enabled in that other repository, the one we just said that you do not and cannot control. Enabling fetch-by-ID is computationally expensive for them—whoever "they" are, the ones controlling the other repository—and not something you can do on your side, nor demand, nor is it enabled by default. This means that in general it's just not available.

In any case, git clone only works with names: you may git clone -b branch url, for instance, to make your new clone start by checking out that specific branch, or git clone -b tag url to make your new clone start by checking out (as a detached HEAD) that specific tag. Despite this "check out a specific branch or tag", though, the clone defaults to cloning all the names offered by the remote, and making a full-depth (i.e., non-shallow) clone.

All of this does mean something important. First, shallow clones exist. A shallow clone is one made with a --depth argument. It can be deepened by a git fetch with another --depth. The "depth" is the number of commits fetched "beyond" the commit(s) identified by the name(s) used during the clone or fetch, with some fairly complicated rules. (The details of these rules don't matter much here.)

Second, because shallow clones exist, shallow submodules also exist. A shallow submodule is simply a submodule that is cloned with --depth. But there is a problem: there is no easy or obvious way to determine what depth is needed. You can pass a --depth argument to git submodule add or git submodule update, but it's not obvious how deep you should go.

Here's the problem: your submodule will be cloned, perhaps by a branch or tag name, but then your submodule will be told to check out one particular commit (by its raw hash ID). Will that commit be in the clone? What depth guarantees that it will? If you are cloning by tag name, and the tag always names the correct commit, you can use --depth 1 (and hence you can use --shallow-submodules during the initial git clone as well), but that only works if, well, see above.


1What's special about these sub-repositories is that they are:

  1. listed in the outer repository (in a .gitmodules file);
  2. generally kept in "detached HEAD" mode;
  3. and detached at a commit whose ID is stored in the outer repository.

The modules file lists the names and URLs for the various submodules. "Initializing" a submodule amounts to copying stuff from .gitmodules to the configuration file for the containing superproject, and "updating" a submodule usually amounts to cloning or fetching. The commit at which the submodule is to be detached is recorded in the superproject's repository as a "gitlink" entry in a tree object.

Submodule support has grown rather complex in modern versions of Git though, so now there are more things you can do when doing the update step.


Reference clones

There is a much better, more general solution for many cases. Instead of fussing with shallow clones, you can point Git at a reference clone. The reference clone is any clone of the repository you're trying to clone.2 Ideally, it's a recent and reasonably up-to-date clone of the repository you are cloning, but any clone will do.

What Git does with a reference clone is a bit complicated (see the documentation for details), but the short version is that when cloning some repository, instead of getting all the objects over the network from some distant server (which may be slow and/or rate-limited), your Git will ask the distant server what objects and such it needs, then look at your local3 reference clone to see if it already has those objects. If so, it will "borrow" them from the reference clone.

This lets you obtain a full, complete, up-to-date clone while using very little network and storage resources, since you will no longer need to bring (most or all of) the data over, nor (unless --detach-ing) store it yourself. That in turn means you need not worry about your shallow clone being too shallow: you just get one slow full clone, then reference the heck out of it for all other clones, which go fast. Using reference clones can cut the time to clone a few big GitHub repositories, from an hour-plus, down to tens of seconds, for instance.


2Technically, the reference could be any repository at all. A repository not actually related to the one you are cloning is going to make a lousy reference, though: it will have none of the objects you need, and will provide no speedup at all. (It could even have the wrong data under the object's name, although the chances of this are vanishingly small. This cannot happen if the reference is correct since object names cannot be reused this way.)

3The reference should be "as local as possible" for speed, but does not really have to be on your machine, just accessible. If the reference will not always be present you will probably want to add --dissociate, so that the objects get copied from the reference clone into the new clone. This uses more disk space, of course.

torek
  • 448,244
  • 59
  • 642
  • 775
  • Sorry for replying to this so late but the idea was that I wanted to incorporate several existing libraries into a project as git submodules but that is not scaling because the time that it takes to clone all those repositories is significant. Shallow clones don't work because they don't have what the repository would look like when the latest commit would be checked out. And reference clones would not work because I want to be able to have the ability to do this on a machine that that does not already have a clone in place – Curious Dec 06 '16 at 05:15
  • How do I add the reference to a submodule? – jasongregori Apr 25 '17 at 16:32
  • @jasongregori: I'm not sure precisely what you're asking: do you already have the submodule in place, and want to change the associated *commit hash*, or do you not have a submodule, and want to add one? In one case, `git add` the submodule to change the associated hash (be sure to *avoid* a trailing slash) and in the other, `git submodule add` (see `git submodule`) or, if you have a very new Git, consider `git submodule absorbgitdirs`. – torek Apr 25 '17 at 16:37
  • I have a repo with submodules already and I want to add a reference to one of the submodules since I already have a copy of the repo some where else on disk. – jasongregori Apr 25 '17 at 17:42
  • Ah, I think I see: you mean you want to add `--reference` to the `git clone` that creates the submodule, when you `git submodule init` and `git submodule update`. Read the bit of the docs for `git submodule update --init --reference`. (Annoyingly, you cannot *add* a `--reference` to an existing repository.) – torek Apr 25 '17 at 18:40
5

Note Shallow clones with the --depth flag do not work because that only pulls the latest commit, and the latest commit has only the changes made in that commit, so the repository is not in the right state.

Then combine a git archive of the boost repo with a shallow clone setting for your submodule:

  • your submodule is still shallow
  • but then you override its incomplete content with the one (complete) of a git archive image of the same repo, making the working tree an exact replica of the remote repo SHA1.

From there, each refresh (shallow) will complement a content which was complete, and will remain up-to-date.

git archive is done in a local clone of the repo:

git archive --format=tar HEAD

If you don't have a local clone, but the boost repo is on GitHub (like, for instance, boostorg/boost), then you can get a compressed image of the current HEAD with a simple curl (no need for git archive then).


As seen in the comment, adding the content of an archive is of no use, as it represents the same content of the commit.

However, this seems incomplete:

git submodule add --depth 1 https://github.com/boostorg/boost

For a submodule update --remote to work (ie to fetch the last commit, instead of keeping the initial SHA1 checkout), you would need:

git submodule add -b master --depth 1 https://github.com/boostorg/boost

Then a git submodule update --init --recursive --remote would fetch the last commit.

See "Git submodules: Specify a branch/tag".

Community
  • 1
  • 1
VonC
  • 1,262,500
  • 529
  • 4,410
  • 5,250
  • Could you please give an example of how to work with `git archive` in your answer as well? I have never used it before.. – Curious Dec 06 '16 at 05:43
  • @Curious I have added a link to git archive man page, an example, and an alternative. – VonC Dec 06 '16 at 05:49
  • This does not seem to work, it only creates a tarball of the current state of the submodule.. – Curious Dec 07 '16 at 06:55
  • @Cur Yes that is what git archive does. That will complete the incomplete content of a shallow depth 1 clone of your submodule. – VonC Dec 07 '16 at 07:08
  • But it did not complete the incomplete content.. it only tarred the current state of the submodule.. what am I doing wrong? I went into the submodule and typed the command you had with the `--output` flag – Curious Dec 07 '16 at 16:54
  • @Curious you still need to uncompress that archive where your shallow clone of your submodule is done. – VonC Dec 07 '16 at 17:01
  • I did do that.. the tarball contains the same thing as the shallow clone of the submodule.. – Curious Dec 07 '16 at 22:51
  • @Curious Then what elements are missing? – VonC Dec 07 '16 at 22:52
  • @Curious "Note Shallow clones with the --depth flag do not work because that only pulls the latest commit, and the latest commit has only the changes made in that commit,". Then I would dispute your last part of your sentence: the last commit represents *all* the commit (a complete set of files) – VonC Dec 08 '16 at 05:21
  • Could you maybe post a sequence of commands that you used to achieve this for the boost repo for example? I am not able to get it to work – Curious Dec 08 '16 at 05:36
  • @Curious if the tarball already represents the same content as the one you are seeing, that means your submodule is already initialized to its right content. – VonC Dec 08 '16 at 05:40
  • I updated my question to include the exact steps that I am following. Could you let me know what I am doing wrong? – Curious Dec 08 '16 at 05:45
  • You said "As seen in the comment, adding the content of an archive is of no use, as it represents the same content of the commit." could you explain what you mean by this? do you mean that git archive actually just archives the latest commit and does not restore the repo to the right state? – Curious Dec 08 '16 at 05:51
  • @Curious git archive represent the last commit. And your submodule is checked out at the last commit. Hence the same content. – VonC Dec 08 '16 at 05:51
  • Ah so there is no other alternative than pulling the whole file with the `git submodule update --init --recursive`? – Curious Dec 08 '16 at 05:53
  • @Curious I have updated my answer: if you want the *last* commit, you need to make your submodule follow a branch. – VonC Dec 08 '16 at 05:55