0

I am very new to Git and I am learning about branches and how to pull/push. Here is my current workflow:

On my laptop:

  • Folder 1: I have one folder containing my "original" project (contains various scripts)
  • Folder 2: I have a second folder which represents my "fake" collaborator (don't have anyone to practise with!)
  • On GitHub, I have a shared central repository.

To be clear, the chronological order of what I did is:

  • I created Folder 1 and made some commits
  • I created this empty repository on GitHub
  • I pushed everyhing from Folder 1 to this central repository
  • I then cloned this repository into Folder 2
  • I then push and pull commits back on forth between my Folders and the GitHub repository.

At this stage, all seemed well. However, I then started playing around with the concept of branches. In chronological order:

  • I created a local branch in Folder 1, made some commits and then merged this branch into my master branch.
  • I then pushed my master branch to the GitHub central repository
  • I then tried to pull the master branch from the GitHub central repository to Folder 2.

This pull seems to have worked (git log showed that the two Folders had the same history of commits - all commits are there) but I have noticed a FETCH_HEAD file in my Folder 2. This file is empty. And this file was never there when I previous pushed and pulled from Folder 2.

Am I missing something here? I can't quite figure out whether what I did something wrong, or whether this is maybe to do with the fact I am using 2 folders on a same laptop (i.e. my collaborator is using the same Git password etc). Am I meant to be seeing some FETCH_HEAD file?

From what I understand, if you make a local branch on your laptop, you can push it and your collaborator can pull it with git fetch.. right? I'm just confused here, because this seemed like a routine push and pull using master branches only.

Apologies if my question is very basic. If it helps, here is the output from Git when I pulled into Folder 2:

# Output:
# remote: Counting objects: 11, done.
# remote: Compressing objects: 100% (4/4), done.
# remote: Total 11 (delta 6), reused 11 (delta 6), pack-reused 0
# Unpacking objects: 100% (11/11), done.
# From github.username/VC-exercise
#  * branch            master     -> FETCH_HEAD
#    4fadbae..d99886d  master     -> origin/master
# Updating 4fadbae..d99886d
# Fast-forward
#  README.md        | 2 ++
#  data/adapters.fa | 0
#  2 files changed, 2 insertions(+)
#  create mode 100644 data/adapters.fa

Thanks.

UPDATE

I was not precise enough. When I talk about FETCH_HEAD in my question, I am not talking about .git/FETCH_HEAD. This file is present in my Folder 2 but on top of that, I have an empty file called FETCH_HEAD directly in my Folder 2, next to all my scripts etc. That is what's bothering. Surely this is not normal.

Additionally, when I type git branch --all in Folder 1, I get this, which looks normal to me:

*master
branch-I-made
remotes/origin/master

When I type git branch --all in Folder 2, however, I get:

*master
remotes/origin/master -> origin/master
remotes/origin/master

What does "remotes/origin/master -> origin/master" mean, is that normal?

mf94
  • 439
  • 4
  • 19
  • It seems to me that you have several questions here, but by focusing specifically on the internal-to-Git file `.git/FETCH_HEAD`, you've made a duplicate of https://stackoverflow.com/q/9237348/1256452. I'm more concerned that you are headed towards a bogus mental model of what Git does here, though. – torek Jul 23 '18 at 15:47
  • I modified my question above to include more specifics, as I was not detailed enough. Indeed I am not talking about the .git/FETCH_HEAD, but rather an extra FETCH_HEAD, which is emtpy, and is in my Folder 2 directory, next to all my files (i.e. not in .git). I understand what you mean about the bogus mental model. Indeed, I wasn't too sure about whether making a "fake" collaborator folder was a good idea.. – mf94 Jul 23 '18 at 16:58

1 Answers1

3

I'm not sure where this particular FETCH_HEAD file came from. As I noted in comments and the link, the .git/FETCH_HEAD file is how git fetch leaves tracks for git pull to run its second Git command (typically git merge but you can select git rebase instead). But that file is hidden away in .git—it should not appear in your work-tree.

(I'm afraid I had little time to work on this, so it's very long.)

A repository is mostly a collection (or database) of commits

If we put that aside, though, let's look at what goes in a Git repository. Remember, each repository is (at least in theory1) a complete, stand-alone copy of everything. Well, almost everything—we'll take a look at what's not shared in a bit—but each repository has a full copy of all of the history of the project. To define this properly, let's also note that in Git, the history is the commits, and the commits are the history. Commits are what Git keeps: a repository is made up of commits.

Each commit is, itself, a logically-complete snapshot of all of its files. That is, once we somehow name a commit, Git can extract the exact version of every file that we had Git save at the time we ran git commit. Each commit also has some metadata associated with it: the name and email address of the commit's author, for instance. Almost all commits—usually all but one commit, in fact—store, as part of this metadata, the name of their parent commit as well. This brings us to a key point.


1When you make a local clone (as opposed to one going over https:// or ssh:// or similar, Git will use various trickery to share the underlying repository storage. Normally it does even this in a way that's invisible: if you delete one of the two clones, the other remains intact. For power users or web providers like GitHub, Git allows even fancier sharing; in such cases, you need to know what you're doing, since sharing that much underlying storage means that there are some repositories that are more significant than others.


The name of a commit, or in fact any Git object, is a hash ID

When you run git log you will see commit hash IDs:

$ git log
commit e3331758f12da22f4103eec7efe1b5304a9be5e9 (HEAD -> master)
Author: Junio C Hamano ...

For a commit object, this hash ID is guaranteed to be unique to that particular commit. That hash ID is, in effect, the true name of the commit. It's the key that Git uses to look up the commit's data, in the database of Git objects. This database is essentially just a key-value store, with the keys being hash IDs, and the values being the object contents.

There are four types of Git objects: commits, which we've just seen, plus trees, blobs, and annotated tag objects. Two of these are not necessarily unique, but all four are identified by their hash IDs. The hash IDs appear random, but are actually cryptographic checksums of the raw object contents, including the object's type field. Since every commit is unique, Git guarantees that every hash ID will also be unique.2 Git can also verify data integrity by comparing the computed checksum of any retrieved object to the hash ID key used to retrieve it: these must match, otherwise some data have been corrupted.

Because the key is a checksum of the contents, it's physically impossible to change any Git object once it is stored in the database. Changing anything, even a single bit, changes the checksum, resulting in a new and different key-value pair. This means that every commit and file stored inside the repository is entirely read-only: nothing inside them can ever be changed.


2If you know a lot about hashing, you know that this guarantee is mathematically impossible due to the Pigeonhole Principle. What Git really does here is make sure that collisions are ridiculously improbable, and then refuse to let you make an object that has a hash collision. See also How does the newly found SHA-1 collision affect Git?


A commit's contents are mostly redirected elsewhere

The object contents of a commit are actually remarkably simple. Here are the contents of e3331758f12da22f4103eec7efe1b5304a9be5e9, for instance:

$ git cat-file -p e3331758f12da22f4103eec7efe1b5304a9be5e9 | sed 's/@/ /'
tree 313f70847d0dab2718d19201b5be3af52061c4da
parent 085d2abf57be3e424cad0b7dc8c27fe41921258e
author Junio C Hamano <gitster pobox.com> 1530215747 -0700
committer Junio C Hamano <gitster pobox.com> 1530215747 -0700

Second batch for 2.19 cycle

Signed-off-by: Junio C Hamano <gitster pobox.com>

Once again, we see the commit's metadata—author name and so on—plus the parent line, which tells us the hash ID of the commit that goes before this commit. The snapshot itself is hidden away in a sub-object, via the tree line that lets Git find the commit's associated tree object.

The tree's contents are much more complicated, but we don't need to go into any details. It suffices to know that this is how Git stores the snapshot that goes with this commit. The tree names all the files, using recursion as appropriate, and gives Git the ability to retrieve each file's snapshot through a blob object. This means that, given either the commit hash ID, or the top level tree hash ID, Git can extract the complete snapshot.

The commit itself just gives us all the metadata: who made the commit, when; the log message they wrote for it; and the parent hash ID, if this is an ordinary, single-parent commit. The fact that each commit records its parent, though, gives us something else crucial.

Commits form chains

If we represent each commit using a single uppercase letter, instead of an apparently-random hash ID, we can draw ordinary commits pretty simply. For instance, in a small 3-commit repository, we would have this:

A  <--B  <--C

Commit C is the last commit we made. It stores the ID of commit B as its parent. Commit B stores the ID of commit A, and since commit A is the first commit we made, it has no parent at all. (Git calls this a root commit.)

Note that these chains always point backwards. Git needs to know, somehow, what the actual hash ID of commit C might be. This is where branch names enter the picture.

Branch names are really name-value pairs, acting as pointers to the last commit

To add master to our picture we just do this:

A--B--C   <-- master

The name master holds the actual hash ID of commit C. From here, Git can find B, which allows Git to find A. A has no parent, so the action stops: we have our three snapshots and we are all good.

To add a new commit, we start by having Git extract commit C somewhere. We use this to build up a new commit D, which stores C's hash ID as its parent; and then we have Git write D's hash ID into the name master:

A--B--C--D   <-- master

If we add a new branch name before we make D, our picture is basically the same:

A--B--C   <-- master, newbr

but now we need a way to remember which branch is the current branch, so we attach the word HEAD to one of these:

A--B--C   <-- master, newbr (HEAD)

Now if we make a new commit D, everything proceeds as before, but the name that Git updates is the one to which HEAD is attached, giving us:

A--B--C   <-- master
       \
        D   <-- newbr (HEAD)

A repository therefore contains two databases that fetch and push work with

The most crucial database is the one containing Git objects, especially commits. Commits are Git's lifeblood, its raison d'être. But to find the commits, Git needs a second key-value database, where the keys are names—branch and tag names, for instance—and the values are hash IDs.

These two databases are what git fetch and git push deal with. Both operations connect two Git repositories to each other. Fetch and push are very similar: both send or receive commits (and other Git objects—trees and blobs—as needed to make the commits complete), and then both update some set of names. The first obvious difference is the direction of transfer: git fetch takes commits from another Git into ours, while git push gives commits from our Git to another Git.

But there's another bit of asymmetry here. In our Git, we have both branch names, like master, and remote-tracking names, like origin/master. Where do these come from?

Branch names come from us creating them. We tell our Git: create the name newbr, pointing to commit C and it does so. We then tell our Git to make a new commit on the current (newbr) branch, and it does so. The name itself got created when we told our Git to create it. But what about master—when did we create that one? This, it turns out, is a little tricky; let's hold off on that for a moment.

Remote-tracking names, like origin/master, are things that our Git creates for us whenever it talks to another Git via the name origin. When we first run git clone url, that action—cloning some existing repository—tells our Git that, as soon as it has created a new empty repository (with no commits and no branches), it should call up another Git via the name origin and the URL we gave, and fetch from that Git all of its commits and branches and so on. Our Git then renames all of their branches: their master becomes our origin/master. If they have a newbr, their newbr becomes our origin/newbr.

These remote-tracking names are, to put it as simply as possible, our Git's way of remembering what their Git said their branches were. Specifically, they hold the hash IDs that go with the branch names on their Git, but renamed to our origin/* names. This means their branch names do not affect our branch names—at least, not yet.

Push and fetch are not symmetric because push writes on their branch names

When we run git push origin newbr or git push origin master, though, we have our Git send their Git any commits that we have that they don't, and then we have our Git ask their Git to set their master. Their repository, wherever it is, does not have a renaming scheme for incoming pushes. We just ask them to set their branches directly, based on whatever commit hash ID our master or newbr names (after we've given them those commits, and any earlier ones needed as well, of course).

When we fetch from them, we remember their branches using our remote-tracking names. That way we don't disturb our own branch names. But when we push to them, we just ask them to set their branches. Hence, while fetch and push are as close as we get to symmetric transfers, they're not the same.

Note that they can accept our request, or reject it. If they do accept our push, our Git will remember that their master or newbr has changed, by creating or updating our own origin/master or origin/newbr.

There are a few more data items that are not transferred

Whenever we make any change to any of our branch names, remote-tracking names, tag names, or, well, any of our names in our name-to-hash-ID reference database, our Git keeps a log of these reference changes. This reference log or reflog of these old name-value pairs is, in effect, another database (or collection of databases) that our Git maintains. Values "fall off" the end of the log after a while, so that the reflogs don't grow without bound: by default there's a limit of 90 days for some reflog values, and 30 days for others.3

There are also a bunch of specially-named references,4 such as ORIG_HEAD, MERGE_HEAD, CHERRY_PICK_HEAD, and so on, plus the special file FETCH_HEAD, all stored in the top level of the .git repository directory. None of these are transferred across fetch and push. However, we already noted the special role of the magic name HEAD (in all capitals—yet another file in .git), in that our HEAD is "attached to" whichever branch Git considers to be the current branch.

What happens here is that during clone and fetch, the receiving Git can see what the sending Git's HEAD is set to. Git uses this on git clone to choose which branch name to hand to git checkout. The receiving Git can assume, or tell directly,5 which branch the sending Git's HEAD names, and create a symbolic remote-tracking name origin/HEAD pointing to the correct remote-tracking name, such as origin/MASTER. This is what you saw in your git branch --all output.


3The key difference between these is whether the hash ID stored in the reflog entry in question is reachable from the current value of the corresponding reference. This reachability concept is another key Git concept. For much more about this, see Think Like (a) Git.

4It's debatable whether these special names count as references. Except for HEAD, none of them have reflogs. Git says that a reference is any name whose fullly-expanded form starts with refs/, but HEAD has a reflog and does not start with refs/, so is HEAD a reference? Git is a little conflicted on this one: some parts say yes, some parts say no.

5This depends on the age / version of both Git installations. Proper symbolic HEAD support has been around since Git version 1.8.4.3.


Other key items to understand: index and work-tree

All of the above has been concerned with commits and Git's object database, and the references (branch and tag names and so on) database. We also noted that files stored inside commit snapshots are in a special Git-specific object format, in which they are read-only. In this format, they get compressed (sometimes highly compressed).

In order to work on a Git repository, however, you need two more items that Git creates for you:

  • Git has a key data structure that it calls, variously, the index, the staging area, or sometimes the cache. This data structure—which is mostly just a file, .git/index—holds, indirectly, a copy of every file in the current commit. These files are in the same highly-compressed Git object format. Crucially, however, they can be overwritten with new (compressed) files.

  • In order to let you actually view and work on your files, Git has to uncompress them into your computer's ordinary format. It puts these files into your work-tree, which is where you do your work. Files in your work-tree are not in a commit, but the initial versions of whatever got into your work-tree will have come out of a commit (through the index / staging-area).

Running git checkout commit-id tells Git to extract the given commit's files, into the index (so that the index now matches the commit), and then on into the work-tree (so that you can view and/or change the files). This results in what Git calls a detached HEAD, where the special name HEAD no longer contains the name of a branch. Instead, HEAD contains the raw commit hash ID.

This particular mode of working is fine as far as it goes, but it means that when you create a new commit, the new commit's hash ID does not get recorded into a branch name. For this reason, git checkout name works by writing the name into the HEAD file, extracting the tip commit, as stored under the branch name.

When you first clone a repository from somewhere, you have no branch names at all. The last step of git clone is to run git checkout name, where name is typically master (but as we saw, comes from the other Git). But you don't have a master yet.

At this point, git checkout does a special thing: it looks through all of your remote-tracking names, to see if there's an origin/master. If there is exactly one such name—and of course there is; your Git just copied their master to your origin/master—your Git now creates a new branch name master in your own repository, pointing to the same commit that your origin/master says their master pointed to.

So that's how it is that you have a master: your Git created it as the last step of your git clone.

torek
  • 448,244
  • 59
  • 642
  • 775
  • Thanks @torek, that was really useful! I think I understand most of this. One thing I thought about later, regarding my weird extra FETCH_HEAD file is perhaps due to the fact that my authentication key, used to push and pull to my GitHub, would be the same for Folder 1 and Folder 2, since they are both on my laptop. So perhaps that is what caused this weird file. In any case, I think I understand the main concepts you explained her. Thanks! – mf94 Jul 24 '18 at 16:20