What you really need here is a good tutorial or book and a week or more of time. Without these, though, here's an overly-rapid launch into what to know about Git and how to use git merge
.
What to know when getting started with Git
Git is really all about commits. It's not about files, though commits do contain files, and you will care a lot about your files. It's not about branches either, though branch names help you (and Git) find commits. To Git, though, almost everything is about the commits. This means you need to know what a commit is and does for you.
Each Git commit is numbered, but the numbers are weird, and at least mildly poisonous to human brains. If and when you need to use them, you'll generally want to use cut-and-paste with your mouse or some such. Still, remember that Git is using these numbers. That's how Git finds the commits: by their number. Each commit has a globally unique hash ID: a GUID (Globally Unique ID) or UUID (Universally Unique ID), whatever you'd like to call it. This number is unique to that particular commit, such that every time you make a new commit, it gets a new number that nobody else, anywhere, ever, is allowed to use.1 That means that two different pieces of Git software, working with two different repositories, can immediately tell if they have the same commit just by comparing the number. If the numbers match, the commits are identical. If not, they're different.
This means no commit can ever change, either: not one bit. So everything inside a commit lasts forever, or at least, as long as the commit itself continues to exist. But what's in a commit?
A commit contains a full snapshot of all of your files. More precisely, it has a full snapshot of the files that Git knew about at the time you (or whoever) made the commit. These files are stored in a special, compressed, read-only, Git-only, and de-duplicated format, which keeps the repository from getting tremendously fat even though every commit stores every file: if some file is a duplicate, it's only stored once, and if an entire commit is nothing but duplicates—this can happen in various ways—the files are stored in zero bytes of storage (though the commit itself still needs a few bytes).
A commit also contains some metadata, or information about the commit itself. This includes the name and email address of whoever made the commit (actually two such). It includes a date-and-time stamp (actually two again). It includes a log message, where you get to explain to your future self why you made the commit. (Note that it's often helpful to go back and write these again later, which is one place rebase comes in. See also this XKCD.)
Now, inside the metadata, Git adds something that Git needs: the raw hash ID—the GUID or UUID—of a list of earlier commits. Most commits store exactly one such hash ID. This results in a simple backwards chain of commits, where each commit holds the hash ID of the commit that comes before it. We say that these later commits point to the earlier ones, and if we use single uppercase letters to stand in for raw hash IDs, and call the most recent commit we just made "commit H
", we get a picture like this one:
... <-F <-G <-H
Commit H
contains a full snapshot of all the files Git knew about, plus the metadata that we made the commit whenever we made it, and so on. Commit H
's metadata makes commit H
point backwards to earlier commit G
, which also contains a snapshot and metadata.
Git can now extract all the files from both G
and H
and compare those files. For those that are the same, Git can say nothing at all (and the de-duplication makes it really easy for Git to tell which files those are). For the files that are different in G
vs H
, Git can compare the contents and work out what changed, as a sort of Spot the Difference game, and then show us this difference only, rather than having to show us two whole versions of the file.
This trick of showing a diff lets git log -p
show us commit H
by:
- printing its raw hash ID;
- showing its metadata, to say we made it yesterday or whenever; and
- showing what changed in
H
, even though it's a full snapshot.
Then git log
can step back one hop to commit G
. Since G
is a commit, it has a snapshot and metadata and it points backwards to earlier commit F
, so git log
can show us commit G
the same way, by "diff"-ing the snapshots in F
and G
. And then git log
will move back one hop to commit F
, which is a commit and therefore has a snapshot and metadata, and so on.
What this shows is that the commits alone get us most of the way. But to get started, git log
had to know to start with commit H
. How will Git do that? We could memorize the hash IDs, and type them in for Git, but that's a bad idea. We could save them in files: that's a better idea, but still not great. How about: we could have Git save them for us?
1This is technically impossible—provably so; see the pigeonhole principle. The hash ID is large enough that we have reason to hope that failure won't happen in any of our lifetimes.
Branch names store hash IDs
This is what branch names are, in Git: they are just a way to store a hash ID. Git stores only the hash ID of the latest commit, e.g., H
, in the name. As with the commits themselves, we say that the branch name points to a commit, and we can draw that in now:
...--F--G--H <-- main
I've gotten lazy about drawing in the arrows from commit-to-commit. That's in part because they literally can't change: like the files inside each commit, the metadata is frozen for all time. Commit H
points backwards to G
, and will do so forever, or at least as long as commit H
exists somewhere in some repository.
The names, though, do change. The name main
currently holds H
. Someday it might hold a different hash ID. We can also create and destroy branch names whenever we like, so we can add a new name now, such as dev
for development:
...--F--G--H <-- dev, main
We now need a way to remember which name we're working with, because Git normally has us work with one branch name at a time. We'll run git checkout dev
or git switch dev
to pick the name dev
to work with, and to remember that in our drawings, let's attach the special name HEAD
to one of the two branch names. We start out on main
, like this:
...--F--G--H <-- dev, main (HEAD)
We're currently on branch main
, as git status
will say. That means we're using commit H
; we'll come back to this in a moment. Then we run git switch dev
or git checkout dev
. There's no real difference between these, except that git switch
was new in Git 2.23. It doesn't do as much as the heavily overloaded git checkout
, so it's better because it's less confusing (this is the "less is more" philosophy, which is inaccurate: less is less, it's just that sometimes, less is also better for humans). The result is:
...--F--G--H <-- dev (HEAD), main
We're still using commit H
. We're just doing that now through the name dev
.
Git's index and your working tree
The files in a commit are read-only, and in fact, only Git can read them. (Depending on how compressed they are—Git has two different ways of compressing files currently—some programs could read one form pretty easily, but in general, most of the programs on your computer probably can't.) This makes them useless for getting any actual work done, because you need files that your programs can read and write. So before we begin working with commit H
, Git has to extract the files.
The extracted files from H
go into a work area, which Git calls your working tree or work-tree. It's important to realize that these files are not in Git at all. They came out of Git, to be sure, and they may go back into Git later, but right now they're just ordinary files, not Git files. You can now do anything you want with them. You can get work done!
Now, the really tricky bit here is that when Git extracted all the files from the commit into your working tree so that you could work on them, Git also extracted the files into what Git calls the index, or the staging area, or—rarely these days—the cache. These are three names for the same thing, and what it holds is, in short, your proposed next commit. Git keeps the files for the next commit in the compressed and de-duplicated form, and the index keeps track of those. The files that are in the index are called tracked.
If and when you do edit some file in your working tree, you will eventually have to run git add
on it. The reason for this is simple: the copy that's in your working tree isn't in Git at all, and for the next commit, Git needs a copy that is in Git, and is compressed and de-duplicated. Running git add file
tells Git: Read the working tree copy of file
. Compress that file down into the internal format that you use for commits, and see if it's a duplicate. Prepare it for the next commit. This replaces the copy that's in Git's index.
What's in Git's index, then, are copies (but pre-de-duplicated) of the files that will go into the next commit. That's why I said just now that the index holds your proposed next commit. The key difference between the files in the current commit and the files in the index are that you can change out the files in the index. You can even add all-new files—git add
of a file that's not yet in the index puts it there—or remove existing files: git rm file
removes a file from both Git's index and your working tree, and now it won't be in the next commit.
When you run git status
, Git runs two separate comparisons:
First, Git compares the current commit, as found by the branch name to which HEAD
is attached, to what's in Git's index. For all the files that are the same, Git says nothing at all. For any file that is different, Git says that this file is staged for commit
. That's where the name staging area comes from.
Then, having listed out any different staged-for-commit files, Git now compares what's in its index / staging-area to the files in your working tree. For files that are the same, Git says nothing at all, again. For files that are different, Git says that they are not staged for commit
: you can and should run git add
to copy the working tree copy into the index if you want.
Because you can create any file you like at any time in your working tree, you may have working tree files that are not in Git's index. Normally, Git will now complain about these files, calling them untracked. To shut up these complaints, you can list these file names, or patterns like *.o
or *.pyc
, in a .gitignore
file or equivalent. This doesn't actually make the files stay un-committed: it just shuts up the git status
complaint here. The files are untracked because they're not in Git's index. Since the index holds the proposed next commit, they won't be in the next commit, unless you add them.
If you do try, explicitly, to add an untracked-and-ignored file, Git will warn you that it didn't do that because you said to ignore it. To force Git to add such a file, you can use git add --force
. That will override the untracked-and-ignored status, and copy the file into Git's index. Once it's in Git's index, git add
will be happy to update it from the working tree copy, regardless of anything in any .gitignore
. So .gitignore
doesn't mean ignore, but rather don't complain (with git status
) and don't add if not there (with git add
). This also handles any en-masse "add all" operations like git add .
or git add --all
: files that are untracked-and-ignored are silently omitted here.
Making a new commit
Once you've updated your working-tree files, run git status
, and run git add
to get all the updates into Git's index so that all your important changes or new files or deleted files show up as "staged for commit", you simply run git commit
. Git will now:
- collect a log message from you, to put in the metadata;
- collect the other metadata it needs: your name and email address, for instance, and the current date-and-time from the computer's clock;
- use
HEAD
and the current branch name to find the current commit hash ID;
- turn all the ready-to-go files in Git's index into a new snapshot; and
- write out a new commit with this metadata and snapshot.
The new commit—let's call this commit I
—has a new, unique, never-used-before, never-to-be-used-again hash ID. It has, as its parent, the current commit, which is commit H
because we had:
...--G--H <-- dev (HEAD), main
when we ran git commit
. We now have:
I <-- dev (HEAD)
/
...--G--H <-- main
and this is because the very last step of git commit
is to write the new commit's hash ID into the current branch name. Since HEAD
is attached to dev
, not main
, it's dev
, not main
, that now points to new commit I
. So now our branch names, which used to both point to the same commit, point to two different commits. New commit I
is only on branch dev
, not on branch main
.
If you make several more commits—as in your case—they get more new hash IDs. I'm going to draw two instead of three here just to make my drawings prettier, but overall everything works out the same here:
I--J <-- dev (HEAD)
/
...--G--H <-- main
Clones, remotes, and remote-tracking names
The above is all about working locally on a Git repository, e.g., on your laptop. But Git is not just a version control system (VCS): it's a distributed version control system, or DVCS. There are multiple copies of each repository, on multiple computers. This "D" part of the DVCS means that other people, on these other computers, can be doing other work on other copies of the repository. You make a copy of some Git repository—e.g., one that you and they keep on GitHub, for instance—and they make copies too, and all of your do your work in your own VCS (usually Git) and eventually send your work to each other, or back to GitHub.
The way Git handles the Distributed part means that you don't have to have a central site like GitHub, but having such a site makes a lot of people more comfortable and has certain benefits. So we'll look at things with a GitHub-centric eye here. I'm also going to call your computer your "laptop", even if it's a desktop or deskside computer, just for easier reference.
You and your co-workers / colleagues view the GitHub copy as the "source of truth": what's in that repository is for real. So you start by cloning the central repository:
git clone ssh://git@github.com/org/repo.git
for instance (perhaps you prefer https://
URLs). This clone operation makes a new, initially-totally empty repository on your laptop: such a repository has no commits and no branches. But your Git software then immediately obtains, from the GitHub Git software reading the central repo—let's call this "their Git"—all of their branch names (and any other names that matter, such as tag names) and the commit hash IDs that go with these. Your Git software is now ready to copy stuff into your Git repository.
Your Git software, running on your repo—let's call this "your Git"—starts by saving the URL under the name origin
. (You can choose some other name when you run git clone
, but normally nobody does that.) Then your Git asks their Git to send over those commits, by hash ID, and their parent commits, by hash ID, and the parent's parents, and so on, until their Git will end up sending every commit. Your Git saves these commits away under these same hash IDs: they are, after all, the same commits, so they get the same hash IDs.
When they're done sending over all their commits, your Git takes all their branch names and changes them. Your Git sticks origin/
in front of each name: their main
becomes your origin/main
, their dev
(if they have one) becomes your origin/dev
, their feature/short
becomes your origin/feature/short
, their feature/tall
becomes your origin/feature/tall
, and so on. Whatever they have, your Git sticks origin/
in front, because that's the name of the remote. Your Git is turning their branch names into your remote-tracking names.
In the end, your Git has copied all of their commits, but replaced all their branch names with your own remote-tracking names. It's easy to convert between branch name and remote-tracking name, because we just add or remove origin/
. The point of all this funny business, though, is this: Just before your git clone
finishes, your Git creates one branch in your repository. The one branch your Git creates is the name you select with the -b
option when you run git clone
. If you don't select a name—and usually people don't—your Git asks their Git which name they recommend, and usually, they recommend main
(in modern usage) or sometimes master
(left over from a year or two ago, and still the default on many systems). You have an origin/main
or origin/master
because they have main
or master
, and your Git thus creates your main
or master
from their main
or master
, which in your Git, is origin/main
or origin/master
.
So, what we've been drawing like this:
I--J <-- dev (HEAD)
/
...--G--H <-- main
really looks like this:
I--J <-- dev (HEAD)
/
...--G--H <-- main, origin/main
(assuming they have only branch main
: if they have more branches, there are more origin/
names in your Git).
Now, since the time you made your clone, someone else made a clone and made new commits in their clone and then sent those new commits back to the central GitHub repository. So you had to pick up these new commits. You do this with git fetch
. Some people run git pull
which does run git fetch
, but if you're new to Git, I advise starting with your own git fetch
yourself: don't start using git pull
until you've learned how to fetch and then either merge or rebase. When you run git fetch
—either literally, or indirectly via git pull
—your Git calls up the GitHub Git software and connects to the central repo again.
As before, your Git has their Git list out their branch names and hash IDs. This time, though, their main
points to some new commit that you don't have. Your Git asks their Git for that hash ID, and that commit's parent(s), and so on, until your Git gets to a hash ID that you do already have. Their Git then packages up and sends over just the new-to-you commits, which your Git adds; and finally, your Git updates your remote-tracking names according to their branch names. So now you have, in your repository, this:2
I--J <-- dev (HEAD)
/
...--G--H <-- main
\
K--L <-- origin/main
Since you and they have both done work in parallel, you now need to combine the work. This is a job for git merge
.
2For posting reasons I'm using just two commits on each side. I'm also using different branch names as I think it's less confusing. Here's a drawing that is closer to your actual situation:
I--J--K <-- working_branch
/
...--G--H
\
L--M--N--O--P--...--W <-- origin/working_branch
Merging works the same way regardless of the number of commits, though, as long as there's at least one on each "side".
Merging
Merging is, as we just said, about combining work.
We know that every commit has a full snapshot of every file, and that if we move along from parent commit to child commit—e.g., from H
to I
, and then from I
to J
—we'll see what changed in that commit. But what if we just compare the snapshot in H
directly to the snapshot in J
? Will that work? What will we get?
It's worth thinking about this for a while, and working through some examples, but in fact it works just fine: we get a summarized recipe from Git that, if applied to the snapshot in H
, produces the snapshot in J
. That is, the diff output, from:
git diff --find-renames <hash-of-H> <hash-of-J>
will tell us which files need changes, and what the final changes are, to get from H
to J
, without having to go through the intermediate I
version. This works no matter how many commits there are in between.3 So a quick diff from H
to J
(or in footnote 2, from H
to L
), will show what you did on your branch. That is, such a change, applied to H
, will make—or keep—all of your changes.
The same principle applies with their changes: a diff directly from H
to L
(or in footnote 2, from H
to W
) finds a shortcut recipe that will make, or keep, all of their changes.
This is just what git merge
does. We run git merge origin/main
while we're on dev
, using commit J
, and Git finds commit L
—because origin/main
points to L
—and then works its way backwards to find the best shared commit, one that is on both branches. That's commit H
here: it's on both dev
and origin/main
, and it's the best one because going further back doesn't help any, but going forward means we don't keep both sets of changes correctly.
So, Git runs the two git diff
commands, which gets a list of changes from "both sides" or "both branches". Git can then combine the list of changes:
- If we touched some file, and they didn't, Git keeps all of our changes.
- If they touched some file, and we didn't, Git keeps all of their changes.
- If we and they touched the same file, Git has to work harder: it has to figure out which lines we might both have touched, if any. If any changes overlap,4 we will in general see what Git calls a merge conflict. The one exception here is that if we and they both make the exact same changes to the exact same lines, Git will just take one copy of the changes.
In any case, Git then tries to apply the combined change to the file from the merge base (H
). Quite often, Git can do the entire merge on its own, with no merge conflicts. If that's the case, Git will go on and make a new commit on its own, which we can draw like this:
I--J
/ \
...--G--H M <-- dev (HEAD)
\ /
K--L <-- origin/main
I dropped the name main
from the drawing for space reasons; it's still there, still pointing to commit H
, for this merge, but it's too hard to draw in as plain-text. The new commit M
, however, has gone on branch dev
, the way new commits always do: HEAD
is attached to dev
, so dev
now points to the new commit.
Commit M
points back to commit J
, just like every commit points to its parent. What makes commit M
special, though, is that it also points back to commit L
: the commit we named when we ran git merge origin/main
. That tells Git that commit M
is a merge, and it brings commits K-L
onto branch dev
. That is, before the merge, branch dev
meant commits up through and including J
,5 but not K
or L
. But after the merge, every commit including K
and L
is on dev
.
In other words, by having two parents, commit M
introduces more commits to the branch. That's what a merge commit does: it has a single snapshot as usual, and it has metadata as usual, but it has more than one parent so it makes more commits find-able just from the one branch name.
Sometimes, though, you get merge conflicts. In this case, git merge
stops in the middle, leaving the merge half-done. Your job, as a programmer, is now to finish the merge.
3If you rename files, a step-by-step comparison going from one commit to the next will sometimes work better, given some of the other things that Git does and does not do. It would be nice if there were a way to make git merge
do this step-by-step thing. There isn't, though.
4The test Git actually uses here is "overlap or abut": if we modify lines 10 through 13 inclusive, for instance, and they modify lines 14–16, our changes "touch at the edge", i.e., abut, and Git declares a merge conflict. The only reason given for "why" is that experience with tens of thousands of merges shows that this is better than not doing so.
5Note that commits up through and including H
are on all three branches, main
, dev
, and origin/main
. That is, they're on branch origin/main
if origin/main
is a branch. Is it? That depends on who, or how, you ask.
Handling merge conflicts
When Git stops in the middle of a merge, it generally leaves a mess behind. You have to fix this mess. There are two components to the mess:
Git leaves stuff in the index that tells Git don't commit, the merge is unfinished. This stuff is useful for finishing the merge.
Git leaves, in your working tree files, its best effort at doing the merge. For each conflicted file, there may be conflict markers. I say may be because there are high-level conflicts that we won't cover here. The low-level conflicts do leave conflict markers in the files.
To fix these, you can:
- open the conflicted files in any editor you like, and resolve the conflicts manually and write the resolved file back to the working tree, or
- use
git mergetool
to run any merge tool you like.
The git mergetool
command uses the extra information that's in the index to locate the conflicted files, and to find the three input files: merge base, "ours" or "LOCAL", and "theirs" or "REMOTE". It then runs the merge tool you choose—Git has no built-in merge tools of its own, but there are a number of free ones you can install, or your OS may provide some—and the merge tool's job is to write the resolved file back to the working tree.
Either way, then, the resolved file ends up in the working tree, as an ordinary file. You can then run git add
on it, or—if you use git mergetool
—Git will automatically run git add
on it. This git add
cleans up the index, marking the file as resolved. Git believes the conflicts are resolved, and the working tree file contains the right merge result, regardless of what you did with the working tree file. If you didn't update the working tree file, git mergetool
may ask you whether it should run git add
: don't do it because the file isn't merged. If you have not merged the file, don't git add
the marked-up-with-conflicts file!
If you know that the merge result should be your version of the file, regardless of any changes they made, there is a shortcut way to do that (git checkout --ours
or git restore --ours
), but be very sure that this is correct before you do it. Look carefully at their changes: run git diff
by hand if you need to, to see what they did, before just discarding their changes with --ours
here.
In any case, once all the conflicts are resolved and git add
-ed, you should run:
git merge --continue
or simply:
git commit
(both do the same thing, committing the merge result). That makes the merge commit M
, just as Git would have done on its own if there had not been a conflict before.
If you decide you want to give up on merging for now, you can use:
git merge --abort
to stop the merge and go back to the state you had before you started the git merge
command at all.