3

I've recently been introduced to --depth 1 for git clone. Apparently this doesn't get all the history and is much faster. I used:

git clone --depth 1 -b develop https://github.MyCompany.com/CoolProduct/CoolProduct.git

This allowed me to play, modify and branch off the develop branch.

However, now I want to look at another branch "BillsFeature" I tried: git checkout BillsFeature and got error: pathspec 'BillsFeature' did not match any file(s) known to git

This makes some sense to me. Presumably because I used --depth 1, I didn't pull down the branch names. How do I get another branch? I don't need the history with BillsFeature either. I should say that I tried: git fetch --depth 1 origin BillsFeature and something seemed to happen. However, when I did git status I got:

On branch develop Your branch is up to date with 'origin/develop'.

nothing to commit, working tree clean

Thanks, Dave

Dave
  • 8,095
  • 14
  • 56
  • 99
  • 1
    It's not immediately obvious that this is a duplicate, but in fact it is, because `--depth` implies `--single-branch` unless you also add `--no-single-branch`. – torek Nov 02 '20 at 17:40
  • I don't see how it's a duplicate. Given that I used --depth (same as --single-branch), how do I get other branches? – Dave Nov 02 '20 at 17:44
  • Also, it seems like the accepted answer is just getting the entire history and all branches? Can I just get one branch at a time with no history? – Dave Nov 02 '20 at 17:48
  • 1
    Because `--depth 1` implies `--single-branch`, you got only one commit for one branch. Use the `--no-single-branch` option during cloning to get one commit for *each* branch (and hence the right number of remote-tracking names), or use the accepted answer to update your existing clone, then run `git fetch --depth 1` to update things so that you have 1 commit for each remote-tracking name. – torek Nov 02 '20 at 17:49
  • (If you'd like, I'll re-open this for direct answers, rather than the sort of indirect realization that `--depth` implies `--single-branch` so therefore `--no-single-branch` is required to defeat it.) – torek Nov 02 '20 at 17:51
  • You can always do `git fetch --unshallow`, please see the https://stackoverflow.com/a/6802238/2443502 – Marcin Kłopotek Nov 02 '20 at 18:11
  • @MarcinKłopotek: `--unshallow` still leaves one with a single-branch clone; the primary issue here is making it non-single-branch. (After that, the secondary issue is what depth the clone should have.) – torek Nov 02 '20 at 23:38
  • Personally @torek, I'd prefer you put the part about "use the accepted answer" and "git fetch --depth 1" in an answer and let me accept it. I think it would be useful for folks. If I were rewriting the question today, I'd write "I have a large GIT repository. I'd like to get our main branch called develop as quickly as possible, but I'd also like to be able to get other branches later. How do I do this?" I don't see any questions/answers that directly answer this question. Thanks! – Dave Nov 04 '20 at 17:37
  • OK, I've re-opened it and will put in an answer that mostly refers to the other answer. – torek Nov 05 '20 at 04:50

1 Answers1

3

The root of the problem here is that git clone's --depth option turns on --single-branch as well. To defeat that at clone time, use --no-single-branch. To defeat it afterwards, see the accepted answer to How do I "undo" a --single-branch clone?

Note that after de-single-branch-ing the clone you have, you will have to run git fetch --depth 1 again. This will retrieve the rest of the branch names from the repository you cloned—all of them become remote-tracking names; see the details below—and allow you to run git checkout on each such name to create a local branch with the same name. You can also use git remote set-branches --add to add individual names to an existing remote; again, you'll need another git fetch --depth.

Optional Reading: Details, or, why the above works

A Git repository—technically, a non-bare repository—really consists of the following three parts:

  • a pair of databases, as described below;
  • an index, by which Git knows what files to commit, i.e., which files to track, although there is much more to the index than just a list of files; and
  • a working tree or work-tree in which you are able to use and modify your files. These files are literally yours, and are not actually in Git at all. The files inside Git, in the main database, are all read-only and in a special compressed and de-duplicated form that only Git itself can use.

When you run git clone, you have your Git copy the main database—the one holding all the commits and files and such—more or less wholesale, but have it read the other database, parse through it and understand it and write, to your clone, a different database.

The --depth flag affects the main database, so that you don't copy it wholesale. The --single-branch flag—which, as we noted, --depth turns on automatically—affects the secondary database. Before we go on, let's give the two databases names, so that we don't keep referring to some awkward phrase like "the party of the first part":

  • The thing I've been calling the "main database" is Git's object store. This is a simple key-value database in which the keys are hash IDs, and the values are Git's commits and other internal objects.1 Usually this is the largest part of a Git repository.2

  • The second database is also a simple key-value store, with the keys being names—branch and tag names included, but also almost all of Git's other names3—and the values being hash IDs. Each name stores just one hash ID, as that's all that is required.

So, to recap, git clone will—without --single-branch and --depth flags anyway—call up some other Git and have it list out all of its branch and tag and other names. It will then use these names to find all the commits and other Git objects in the original repository, and have the other Git send over all of those objects. The result is a full copy of the object database.4 You now have all of the commits from the other Git repository.

At the same time, though, your own Git takes all of their names and picks-and-chooses which names to take, and what to do with them. In general, your Git takes all of their branch names—whose full spellings are things like refs/heads/master, refs/heads/topic, and so on—and renames them to become your own remote-tracking names instead: refs/remotes/origin/master, refs/remotes/origin/topic, and so on. Your Git then creates its own independent name-to-hash-ID database, with no branch names in it.5

The end result is that immediately after this step of git clone, you have all the commits and none of the branches! This situation is quickly rectified by the last step of git clone, though. Provided you didn't say --no-checkout, the last step of git clone is to run git checkout, and this step actually creates one branch. The branch name your Git creates is the one you supplied with the -b option. If you did not supply a -b option, your Git asks the other Git which branch it recommends, and if all else fails, your Git assumes your own default initial branch name.6


1Each commit object refers to a (single) tree object, which holds the snapshot for that commit, and has metadata. Each tree object holds an array of partial file names—name components, that will be strung together as needed—and another hash ID. That hash ID identifies either another tree, or a blob object that stores some file's content. Git builds up the files' full names by reading all the sub-trees as needed, and stores the full file names in its index, and then extracts the files using the names and blob hash IDs as seen in the index. This isn't a complete description, but is why Git can't store empty directories: there's no way to put one into Git's index.

The object database can also contain annotated tag objects, each of which holds a hash ID, usually that of a commit. These are how Git provides its annotated tags.

2There are exceptions: old repositories that for some reason keep accumulating new names, e.g., new branch and tag names, but hardly ever get any new commits. But in general the object database is where most of the space is used, and most of the time for an initial clone.

3The other names include things like notes, in-progress bisection, names needed during some interactive rebases, and so on. Basically any name that will store a single hash ID goes into this database. Names that don't do that, such as the names of remotes like origin, don't go in here. Those generally go in the config file in the .git directory.

This database is currently implemented rather poorly. Sometimes the names are stored as directory-and-file-names in the file system, which means that on case-insensitive file systems such as the default ones on Windows and macOS systems, branch names become case-insensitive. Sometimes the names are stored in a plain-text file named packed-refs, which makes them all case-sensitive as Git always intended. A few special names, such as HEAD, never go into the packed-refs file at all and are instead always stored as individual files within the .git directory. There is work going on right now to provide a proper database, to solve a bunch of issues here.

4Technically, the result can and usually will omit any objects that cannot be found by using the names. We'll ignore this fine distinction here, though.

5Your Git will normally omit all of their non-branch non-tag names too. How it handles their tag names is complicated, but in a normal (not single-branch, not depth-limited) clone you normally wind up copying all their tag names.

6This used to be just hard-coded as master, but it is now becoming configurable.


How --single-branch affects this

With the --single-branch option, your Git doesn't use all of their names. Instead, your Git uses only the one branch name from your -b option, with the same default: if you don't supply -b, your Git asks their Git what they recommend, or falls back on yet another default. Your Git then transforms that one branch name into one remote-tracking name. It makes sure to ask their Git only for commits that are on that branch, in that other Git repository.

The end result is that you get one remote-tracking name, and some subset of all of their commits. The final git checkout step then creates one local branch name: the same name your Git used when selecting the subset of commits to obtain.

How --depth affects this

Aside from automatically turning on --single-branch—but note that you can turn this off with --no-single-branch—what --depth does is to create a shallow clone. To understand shallow clones completely, we have to get into graph theory. (We won't go very far with this here, though.)

In Git, each branch name identifies exactly one commit. But a branch in Git—if we ignore the question of What exactly do we mean by "branch"? (we shouldn't ignore it, but we will here)—usually has a bunch of commits. How does this work?

The answer is that each commit in Git contains the hash ID of some earlier commit. In the usual simple case, we end up with a long string of commits, each of which points backwards to one earlier commit. The last commit in this chain is the tip of the branch, or tip commit.

Let's draw a simple chain where we use one uppercase letter to stand in for the real hash ID of each commit. Hash H will be the last one in the chain, and we'll say that this is branch br1:

... <-F <-G <-H   <-- br1

The name br1 holds the hash ID of the last commit H. That's how we can have Git fish it out of the object database (which, remember, is a simple key-value store: the hash ID is the key). But inside the body of commit H, Git has stored the hash ID of earlier commit G. So from H we can get G's ID, and have Git look up commit G in the key-value store. Meanwhile commit G has F's ID, so we can walk backwards from G to F.

This is how Git works: backwards. A name, like a branch or tag or remote-tracking name, stores one hash ID. That's the commit we want, and then, if we want all the commits, Git walks backwards from that commit to the previous commit, and then keeps walking. The name lets us get started; the commits themselves provide the rest of the path.

The path we traverse, and all the commits we collect as we walk this path, are the reachable commits on that branch.7 When two branches diverge, they have some sequence that's common to both:

             I--J   <-- br1
            /
...--F--G--H   <-- shared
            \
             K--L   <-- br2

Here, commits up through H are on all three branches, and the last two commits on each of the br* branches are unique to their branch.

This reachability idea is at the heart of Git. It's also how --depth works. If we say --depth 1, we are telling our Git: When you obtain commits from the other Git, only go one step. If we use --depth 1 here, we get:

             i--J   <-- br1

        g--H   <-- shared

             j--L   <-- br2

If we use --depth 2, we tell our Git: When you obtain commits from the other Git, go two steps. This time we get:

             I--J   <-- br1
            /
     f--G--H   <-- shared
            \
             K--L   <-- br2

Note that if br2 had more commits unique to it, we wouldn't have the connection from br2 back to shared.

The lowercase commit letters here denote the fact that Git knows there's a parent, but that these parents are marked as "missing on purpose". More precisely, the hash IDs of the shallow graft commits are saved in a file called shallow in the .git directory. Git knows not to try to load up these commits from the object repository, and that it's not a bug that they're missing. Normally, that would be a bug.

Since they're missing-on-purpose, git log can't and won't show these commits, and it will be as if the shallow-grafted commits have no parents at all. That's misleading in a way, but also what you should expect. In most cases, it's harmless enough.


7This assumes the name we used was a branch name. If we used a tag name, these are the commits reachable from the tag; if we used a remote-tracking name, these are the commits reachable from the remote-tracking name. Since all names use the same system, each name provides some way to reach some set of commits.


It's the git fetch operation that gets commits

When we use git clone, we're really running the equivalent of a six-command sequence, five of which are Git commands:

  1. mkdir, to create a new empty directory / folder;
  2. git init, to create a new empty repository in the directory made in step 1;
  3. git remote add, to add the name origin, or some other name of our choice, and a URL and a fetch configuration–that's the one we change to defeat the single-branch-ness;
  4. git config, if needed, to add configuration options specified at the git clone command;
  5. git fetch, to obtain commits and make remote-tracking names for the branch or branches chosen in step 3; and
  6. git checkout, to create one local branch name and fill in Git's index and our working tree.

The --depth option is passed to the git fetch at step 5. So if we have to adjust our origin remote configuration, to de-single-branch the clone because step 3 added the remote with one particular branch only (see the git remote documentation), we have to run a new git fetch. This new git fetch needs the same --depth option.

Conclusion

The --depth option to git clone turns on both --single-branch, which limits the set of names—and thus commits—obtained from the other Git repository, and passes the --depth to the fetch step, which limits the depth of commit-graph obtained from the other Git repository. Using --no-single-branch at clone time inhibits the name-restricting while keeping the depth-restricting. If you need to undo the name-restricting, or if you use git remote to update the set of restricted branch names, you must run git fetch again. If you want that git fetch to have a depth restriction, you must pass --depth again.

Note that git fetch does respect existing shallow graft points, so in some cases, omitting the --depth is somewhat harmless. For instance, if you have a single-branch clone of a repository that looks like this:

...--V--W--X   <-- main
            \
             Y--Z   <-- topic

and your single-branch clone is depth 1 on main, so that commit W is marked as a shallow graft point:

        w--X   <-- main

then adding topic without a --depth gets you:

        w--X   <-- main
            \
             Y--Z   <-- topic

That is, main didn't get any deeper this time. But if the graph were:

...--V--W--X   <-- main
      \
       Y--Z   <-- topic

and you added topic and fetched without a new --depth, you would get:

...--V  w--X   <-- main
      \
       Y--Z   <-- topic

in your clone, which means you'd have to get commit V and everything earlier. Note that commit W remains marked-and-missing: since it's missing, your Git can't see that w would connect back to V and your own Git will show you this as:

           X   <-- main

..--V--Y--Z   <-- topic

—which isn't wrong, technically, it's just misleading.

torek
  • 448,244
  • 59
  • 642
  • 775
  • Thank you torek, both for the answer and the background information. I suspect this will be useful to many people in the future. Git is widely used and the question of " I want to "get up and running quickly, but then get other branches later" is a common one I believe. Thanks! – Dave Nov 06 '20 at 17:03