How does git work on a technical level to allow a single file to exist in two states at once?

Question

I am very new to git and github and am trying to wrap my head around all the different functionalities of this program.

I am currently exploring making new branches in my local repo, pushing those branches to my remote repo, and switching between branches in my local repo, and I've encountered a property that is both confusing but really interesting, and I was hoping someone could offer some clarification as to how this property works.

I am working with a single html file. On my local repo, I created a new branch, I checked out the branch, and I opened this file within the branch and made some edits. Then, I went back to my other branch, opened up the same file, and as expected, the edits I made were not there (since they exist on the other branch, and not the one I am currently on). I understand this on a conceptual level (you make changes to a file on one branch, obviously these changes won't be present on the other branch unless you merge them). BUT, what I am confused by is that on my machine, I only have a single copy of this file... but somehow this file simultaneously exists as two different versions on my machine. It's a property that once again, I understand on an abstract level, but I would appreciate an explanation of how files can have this property.

You might find [this article](https://maryrosecook.com/blog/post/git-from-the-inside-out), [this article](https://wyag.thb.lt/) and/or [this article](http://gitlet.maryrosecook.com/docs/gitlet.html) useful — gman, May 24 '20 at 18:08
https://stackoverflow.com/a/8198276/7976758 found in https://stackoverflow.com/search?q=%5Bgit%5D+how+does+store+files — phd, May 24 '20 at 18:12
And my favorite [how it works article](https://tom.preston-werner.com/2009/05/19/the-git-parable.html) — gman, May 24 '20 at 18:14

score 4 · Answer 1 · answered May 24 '20 at 18:56

4

on my machine, I only have a single copy of this file

But you don't.

Git remembers all the versions of the file (git add literally adds a snapshot to the repository) and it puts whichever one you want to see in your filesystem at the expected place on demand.

answered May 24 '20 at 18:56

jthill

55,082
5
77
137

1

And it's worth pointing out that git *logically* stores every version of every file, but *physically* it does a lot of optimization behind the scenes. If it actually stored full copies of every version, it would be impractical. – Keith Thompson May 24 '20 at 23:31
True enough, it starts with full snapshots and compresses in stages, cheap-n-prettygood zlib compression on add, and when it looks like there's big wins available it'll `git repack` with some industrial-strength compression including lots scope for finding deltas from history (not just the previous version, and not just that one file). – jthill May 25 '20 at 01:44

torek · Answer 2 · 2020-05-30T13:17:25.303

As jthill said, your working tree or work-tree has only one copy of the file. Git has, in its commits, every copy of the file: each commit has one copy of each file. The copies are de-duplicated, in a clever manner that depends on the fact that nothing in Git, once committed, can ever be changed. So the files inside commits are frozen for all time, along with the rest of the commit (there's a bit of stuff besides just the files).

More precisely, each commit has a full snapshot of the files that you had told Git to put into that commit, at the time you made that commit. Or, if it wasn't you that made the commit, insert some other actor as the person invoking Git commands.

These committed files are in the repository, contained by right of being stored inside each commit. But the files that you see and work with, in your work-tree, are not in Git at all. I think it helps, conceptually, if you think of the work-tree files as yours: you are responsible for these files. The files in commits—the ones in each commit snapshot, made when you or whoever ran git commit—are the responsibility of Git.

Once you have this in your head—that Git is just copying one set of its files out of a commit, over top of your files—a lot of things fall into place. The remaining rather large surprise is that in an important way, branches don't matter. What matters in Git are, always, the commits. The branch names like master or develop are just one way of finding specific commits.

When you clone a repository, or use git push or git fetch,¹ you're asking your Git to connect to some other Git. So there are multiple copies of each repository. These repositories share commits—by copying them—but they need not share their branch names at all. That's OK, because it's the commits that matter, not the branch names.

¹Don't think of git pull as the opposite of git fetch, because it's not. Think of fetch and push as the two opposites. Well, ok: they're as close as Git gets to opposites here. Mercurial got this particular terminology right (in Mercurial, pull does what fetch does in Git) and Git just sort of got it backwards.

Branch names don't matter, except to humans

The real name of a commit is its hash ID. To see the hash ID of some commit, use git rev-parse, whose job is to turn a name into a hash ID:²

$ git rev-parse master
b994622632154fc3b17fb40a38819ad954a5fb88
$ git rev-parse origin/maint
af6b65d45ef179ed52087e80cb089f6b2349f4ec

These hash IDs are how Git finds commits—at least, some specific commit that we humans might care about right now. The name master is specifically a branch name, while the names origin/maint or origin/master aren't branch names. But all of these names locate some commit. Sometimes, more than one name locates the same commit:

$ git rev-parse origin/master
b994622632154fc3b17fb40a38819ad954a5fb88

This is the same hash ID that I got for my master here. That's no coincidence: the Git repository I cloned has a master branch, and the last time I talked with that Git repository—a few weeks ago at this point—they had their master set up to remember commit b994622632154fc3b17fb40a38819ad954a5fb88. So I told my Git that it should remember b994622632154fc3b17fb40a38819ad954a5fb88 under my name master, too.

Whenever you use branch names in Git, you're telling Git: Remember this commit hash ID under this name. The special property of a branch name—different from a remote-tracking name like origin/master,³ for instance—is that if you use git switch or git checkout to select its commit, something special happens:

$ git switch dev
Switched to branch 'dev'
$ git switch master
Switched to branch 'master'
Your branch is up to date with 'origin/master'.

If you pick a non-branch name, git switch complains while git checkout puts you into detached HEAD mode:

$ git switch origin/master
fatal: a branch is expected, got remote branch 'origin/master'
$ git checkout origin/master
Note: switching to 'origin/master'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
...
HEAD is now at b994622632 The eighth batch

Note that git switch, which is a more-user-friendly command, allows you to get into detached HEAD mode the same way, but only on purpose: you have to add --detach to the command. Detached HEAD mode has its uses, but everyday work is not one of them, so it's wise to get back in a branch, for your own mental health:

$ git checkout master
Switched to branch 'master'
Your branch is up to date with 'origin/master'.

and we're back in the happier state in which Git will remember hash IDs for us, using our branch names. If you don't have Git remember them for you, you will have to memorize these hash IDs, and that is no fun at all.

²Well, that's one of its jobs. Git has a tendency to load too many jobs into too-few commands. That's why Git 2.23 and later have git switch and git restore, while earlier versions of Git jam both commands into git checkout.

³About these origin/* names: git switch calls origin/master a remote branch, but that is a terrible name. Git documentation calls it a remote-tracking branch name, which is slightly better. I use the phrase remote-tracking name, to try to get away from the word branch, which is way too overused in Git. The real key here is to remember that while it's a name, it's not a branch name in the sense that you can't git switch to it.

Commits remember previous commit hash IDs

The last piece to this particular puzzle is a clever (and/or sneaky) trick. If a commit has a hash ID—and it does—and if that hash ID is how Git finds the commit—and it is—then what happens if we have every new commit we make, remember the raw hash ID of the commit that comes just before it?

That is, suppose we have a string of commits like this, except that they have real hash IDs instead of single uppercase letters:

... <-F <-G <-H

Here H stands in for the real hash ID of the latest commit. Let's have Git remember the actual hash ID, using the branch name master, like this:

... <-F <-G <-H   <--master

We say that the name master points to commit H. But we told Git, when we made H, that Git should have commit H remember the hash ID of commit G! So given that we're working with commit H right now, Git can just look up the hash ID of G using commit H itself. Commit H points to earlier commit G.

Of course, earlier commit G points to even-earlier commit F, and so on, all the way back to the very first commit. That commit doesn't point backwards, because it can't, so that's where Git gets to stop and rest. Otherwise, if you start Git with the name master, Git will find H, then use that to find G and then F and E and so on all the way back to the first commit A:

A--B--C--D--E--F--G--H   <-- master (HEAD)

which is our repository with eight total commits, all in one line.

Branches

Let's say we have this structure at the moment:

...--G--H   <-- master (HEAD)

If we now create a new branch name, but let it signify commit H too, we get:

...--G--H   <-- dev, master (HEAD)

We can now attach the special name HEAD to either branch name. It doesn't matter which name we use because both mean commit H: the files we see in our work-tree will be the same either way. But let's switch to dev, with git switch dev or git checkout dev:

...--G--H   <-- dev (HEAD), master

Now let's make a new commit, in the usual way.⁴ This new commit gets a new, unique hash ID, which is big and ugly and unpredictable;⁵ but let's just call it I.

New commit I automatically points back to existing commit H:

...--G--H
         \
          I

and now Git pulls its really-sneaky trick: git commit writes the new hash ID into the name dev, because that's the name HEAD is attached-to. So the branch name dev moves, giving us:

...--G--H   <-- master
         \
          I   <-- dev (HEAD)

Note how the name master still selects commit H, while the name dev now selects commit I. If we make another new commit here we get:

...--G--H   <-- master
         \
          I--J   <-- dev (HEAD)

Git will now find commit J using the name dev, and find commit I using commit J. Git has two ways to find commit H: the name master finds it directly, and dev finds it after traversing two hops backwards, from J to I to H.

In Git, the commits up through H are on both branches. Commits I and J are only on dev. If I and/or J contain files that H doesn't, switching from dev back to master will remove these files from your work-tree: you told Git set up my work-tree based on commit H, and it does that. Switching from master to dev brings the files back, because you told Git: set up my work-tree based on commit J.

If we go back to commit H and create and switch to a new name topic, we get:

...--G--H   <-- master, topic (HEAD)
         \
          I--J   <-- dev

and now we can create new commits as usual:

          I--J   <-- dev
         /
...--G--H   <-- master
         \
          K--L   <-- topic (HEAD)

⁴I've just glossed completely over the complicated way Git makes new commits, which involves Git's index. I won't go into details in this answer, though.

⁵Technically, if we know:

what source files, exactly, will be in the snapshot (all their names and contents);
what metadata you'll give Git—your name, email address, and so on, and the log message you will use; and
the hash ID H and the exact date and time at which you will make new commit I;

then we could predict what the actual hash ID of commit I will be. But how will we predict all of these? So we might as well think of I as being "random".

Draw graphs!

I flipped dev to the top row just so that the "bigger letters" K and L would be on the bottom. You can draw the graph any number of ways, as long as the connections from commit to commit, the backwards links from J to I and the like, are still drawn and as long as you label the correct commits with the correct names. You can leave out some names, and some commits—like the ones before G—when they just clutter up the drawing.

Whatever you do, though, it's really good exercise to draw a bunch of graphs—on paper, on a whiteboard, or whatever. When you do this you'll notice things, like:

Branch names find the last commit in a chain. Git calls this the tip commit of the branch.
The arrows all go backwards. Git has to start at the end, and work backwards.
If a chain has no name for its last commit, Git can't find any of it.

Knowing these things leaves you in a good position for learning all the other mysteries of Git, such as how git merge and git rebase work.

Okay first question as I read through this -- you say not to think of git fetch and git push as opposites; actually, that's not what comes to mind when I think of opposites. I think of git push and git pull as being opposites. Is this accurate? Also what is the difference between fetch and pull? — Brenda Thompson, May 30 '20 at 02:06
Second question -- I see that you really don't like the term branch, and it is really interesting to learn that branches don't really matter. But, it is confusing to me, because I thought the point of a branch was to store a sequence of commits which differ from another sequence of commits. And also, isn't it possible to create an empty branch? If so, is an empty branch basically just a hash number that acts like a vacant space for future commits? — Brenda Thompson, May 30 '20 at 02:15
Third question - you say that "We can now attach the special name HEAD to either branch name. It doesn't matter which name we use because both mean commit H: the files we see in our work-tree will be the same either way." but the files we see won't be the same -- we will see the copy of the file in branch A, and if we switch to branch B, we will see the branch B file copy. Correct? Or is a new file copy only created after a commit? So File A on Branch A exists, then we create Branch B; is a new file copy created at this time? Or is Branch B just a name now associated with file A? — Brenda Thompson, May 30 '20 at 04:32
1: Fetch and push are as as close as Git gets to opposites. The Mercurial command `hg pull` is spelled `git fetch` in Git, and the Mercurial command `hg push` is spelled `git push` in Git. So in Mercurial, the two words are used correctly. But in Git, `git pull` means *run `git fetch`, then run a second Git command*. There is no second Git command with `git push`. So that's where Git went wrong in assigning *meanings* to the two words, and is why `fetch`, not `pull`, is the opposite of `push`. — torek, May 30 '20 at 13:20
2: It's not so much that I don't like the term, as that the word itself, in Git, is ambiguous. Different people use the same word to mean different things, and thereby confuse each other. One person will use the same word to mean up to three different things. Imagine you're at a party and everyone's name is Bruce. Bruce tells you that Bruce went to fetch Bruce, who's with Bruce at Bruce's house, but Bruce is in the kitchen if you're looking for Bruce. Who did what? Who is where? (And no, there's no such thing as an empty branch in Git, regardless of which of the meanings you use.) — torek, May 30 '20 at 13:23
3: No, there's only one *committed* copy of the file: it is in commit H. If name A refers to H, and name B refers to H, you can use either *name* and you get the copy that is in H itself. The *names* in this case are just ways to refer to the hash ID. — torek, May 30 '20 at 13:24

How does git work on a technical level to allow a single file to exist in two states at once?

2 Answers2

Branch names don't matter, except to humans

Commits remember previous commit hash IDs

Branches

Draw graphs!