I closed the erroneous PR, but do I just go ahead make a new change to the file and commit and push again?
Generally, yes. But, as you've seen in comments, there are some complications.
Long: everything you need to know about GitHub PRs
There are several things to understand here. These come under two general topics:
- Git doesn't care about branches. Git only cares about commits.
- Git does not have Pull Requests. PRs are an add-on, provided by various web hosting sites, becaues they're useful in terms of making certain operations easy ("one click" web thingies for instance). As a result, the specific details for updating or replacing a pull request vary somewhat between different hosting sites.
Still, the fact that Git itself only really cares about commits has a certain leak-over effect into these per-hosting-site PR mechanisms. So the two topics intertwine.
How Git uses branch names to find commits
Let's start with the Git command line. We run git log
or git log --oneline
, or maybe git log --all --decorate --oneline --graph
("Git Log with A. D.O.G."; see Pretty Git branch graphs). Git spills out stuff like the following:
* 670b81a890 (HEAD -> master, origin/next, origin/master, origin/HEAD) The second batch
* 98f3f03bcb Merge branch 'fc/doc-build-cleanup'
|\
| * 7ba3016729 doc: avoid using rm directly
| * db10fc6c09 doc: simplify Makefile using .DELETE_ON_ERROR
| * 471e7b2cf6 doc: remove unnecessary rm instances
| * 56da21392b doc: improve asciidoc dependencies
| * 12d078ed2b doc: refactor common asciidoc dependencies
* | 2019256717 Merge branch 'ab/test-lib-updates'
|\ \
| * | f0d4d398e2 test-lib: split up and deprecate test_create_repo()
Each of these stars, along with its connecting lines, represents one commit. Each commit itself is numbered: that big ugly hash ID, here trimmed to a mere 10 characters—each one is 40 characters long—is a unique number that means that commit, and only ever that particular commit. (These particular commits are in a clone of the Git repository for Git.)
Each commit, which Git finds by its unique hash ID, stores two things:
A commit stores some metadata, including the name and email address of its author, for instance. The metadata tell us who made the commit, when, and—at least if the author is conscientious—why they made that particular commit. But it also stores the hash ID of some earlier commit, or commits.
Meanwhile, each commit also stores a full snapshot of every file. That's where Git gets the files that it will put into your working tree, if and when you select that commit to be extracted.
By comparing any two snapshots—often, two adjacent ones such as 7ba3016729
and db10fc6c09
for instance—Git can tell you which files changed between those two snapshots, and show you the exact changed lines. But to do this, Git has to find the hash IDs of the two commits.
You can give these to Git directly yourself:
git diff db10fc6c09 7ba3016729
for instance will have Git extract, into a temporary (in-memory) area, these two commits, and compare their snapshots. The result is that we see that Documentation/Makefile
changed. Or, we can give Git—or perhaps GitHub—the full hash ID of the later of this particular pair of commits, and Git will automatically find the parent for us and compare the two commits (try the GitHub link here, and note that it has embedded in it just the later commit hash ID).
Git finds an earlier commit's hash ID using the metadata in the later commit. By giving Git—or GitHub—the full raw hash ID 7ba30167291eb89f2e587b7cabfa4e7555de4ed5
, Git can start at that commit and work backwards. The commit's parent, db10fc6c09f1f74c4d0a9294ecbb68d390f54f15
, is a commit too, so it has metadata, which gives yet another parent hash ID. That is yet another commit, so it has metadata, and so on.
Using this metadata, Git can work backwards through the history in a repository. The commits are the history, via the snapshots and metadata. But there's one big hitch: where do we get the hash ID of the last commit in some string of commits? The answer is, typically, from a branch name like master
.
The very first line of the git log
output I quoted above begins with:
* 670b81a890 (HEAD -> master, origin/next, origin/master, origin/HEAD)
The stuff in the parentheses here are what Git calls the decorations, from git log --decorate
. This --decorate
flag is the default now and has been for quite some time, but if you have an ancient version of Git, you may still have to use an explicit --decorate
. What it does is have Git look at all your branch names, all your remote-tracking names, and all your tag names and other such names. Each of these names stores one hash ID. In my particular case, three of my Git repository's names—master
, origin/master
, and origin/next
—all store hash ID 670b81a890388c60b7032a4f5b879f2ece8c4558
.
When I run git log --decorate --oneline --graph
and don't tell Git where to start the log operation, what Git does is this:
Look up the name HEAD
. This particular name contains another branch name:
$ cat .git/HEAD
ref: refs/heads/master
As a result, look up the branch name master
(full name refs/heads/master
). This contains 670b81a890388c60b7032a4f5b879f2ece8c4558
: the hash ID of the first commit to be printed.
So that's where git log
starts. It uses hash ID 670b81a890388c60b7032a4f5b879f2ece8c4558
to find a commit, specifically this one. That commit has one parent, namely 98f3f03bcbf4e0eda498f0a0c01d9bd90de9e106
. That commit is a merge commit, with two parents instead of the usual one; this makes git log
's job harder, but what it does is to go on and display both of those commits (eventually), and their parents, and so on.
In other words, Git used the branch name—master
, in my case—to find the last commit in the branch. Then it used that last commit to find the second-to-last commit. Then it used that second-to-last commit to find yet more commits, from which it found still more commits, and so on. If I didn't stop it, git log --decorate --oneline --graph
would go on to list 63272 commits (at the moment).
Let's reduce all of the above to a simple drawing
The actual hash IDs of commits are big, ugly, and random-looking. To make a simple drawing, let's replace the hash IDs with single uppercase letters. This wouldn't work in a real repository because we would run out of letters way too fast, but it's nice for a drawing:
...--F--G--H <-- master (HEAD)
\
I--J <-- feature
Here, we're on our master
branch. The name master
holds the hash ID of the latest master
commit. That's commit H
. Commit H
points backwards to earlier commit G
, which points backwards to another still-earlier commit F
.
If we run git checkout feature
or git switch feature
, we get this:
...--F--G--H <-- master
\
I--J <-- feature (HEAD)
The name feature
holds the hash ID of commit J
, which is the latest feature
commit. Commit J
points backwards to earlier commit I
, which points backwards to earlier commit G
.
Note that commit G
, and all earlier commits, are on both branches. This might be clearer if we draw the commits as:
H <-- master
/
...--F--G
\
I--J <-- feature
It's important to realize that these are the same drawings, even if they look a bit different. The commits that are "on" some branch are those we can get to by starting from the most recent commit, found by using the branch name, and working backwards.
When we git checkout
or git switch
to a branch and make a new commit, the new commit automatically extends the branch, like this:
H--K <-- master (HEAD)
/
...--F--G
\
I--J <-- feature
Here, we made a new commit on master
, creating commit K
. New commit K
points back to existing commit H
.
Sending commits from one Git repository to another
When we work with Git, we're working with a distributed system. Each clone of some repository has all the commits—or more precisely, all the commits it has. In particular, if we clone some repository:
git clone ssh://git@github.com/path/to/repo.git
and then someone adds new commits to the GitHub repository, we don't have those commits yet. Or, if we add new commits to our repository, the GitHub repository doesn't have our new commits yet.
To fix this, we need to be able to get commits from some other Git repository—such as the one over on GitHub—or send commits to some other Git repository (the GitHub one, again). This is where git fetch
and git push
come in.
Without going into all the gory details (there are many), what these two commands do are:
- have one Git repository call up another one;
- pick the caller as sender and the callee as receiver (
git push
), or the caller as receiver and the callee as sender (git fetch
); and
- figure out which commits the sender should send and the receiver should add to their collection.
Git does this by the commit hash IDs. The hash IDs are unique: no two different commits ever use the same ID, and if two commits have the same ID, they are the same commit. In a sense, the ID is the commit. This is why commits must not change, and Git uses an internal consistency check to make sure that commits never do change. So whoever is the sender just says: I have commit ________ (fill in the blank with a hash ID). Whoever is the receiver replies with Oh I don't have that, send it or No thanks, I have that already. If the receiver needs the commit—i.e., does not have it—the sender is obligated to offer its parent commit(s) as well, and the receiver looks to see if it has those commits too, and replies as before.
In this way, the sender sends, to the receiver, all the commits that the sender has, that the receiver lacks, that the receiver will need. (The sender can choose which commits to offer at all, as is the case during git push
, or just list out all of its branch and other names and the hash IDs, as the usual case during git fetch
.) The receiver takes the new-to-it commits and stores them in its big database of all commits and other supporting objects.1
Having done all this, though, there's now a problem: Git finds commits by using names, such as branch names. If the receiving Git has new-to-it commits, it almost certainly needs to update some name or names.
Here, git fetch
and git push
work differently:
With git fetch
, the receiver normally updates remote-tracking names, such as origin/master
and origin/next
. These are names that your own local Git dedicates, just for commits obtained from the Git you're calling origin
. These are not branch names, not in your own Git repository anyway. These are commits that your Git saw that their Git found by their branch names. They are using their master
to find some commit, so your Git sets your origin/master
to find that same commit.
If you want, you can, at this point, update your own master
to find the same commit. (I do this with this particular Git repository for Git myself—this is not the one I generally work in, it's mostly a mirror I update a bit lazily.) That's not part of git fetch
though: that's a second step.
With git push
, though, the sender usually tells the receiver: Please, if it's OK, update one of your branch names to remember this particular commit hash ID. That is, you, on your laptop perhaps, add new commits to your own local repository, using some branch name. Then, using that same branch name, you have your Git send your new commits to your GitHub repository, and then ask your GitHub repository to set this same branch name to find the same commit.
At this point, we need to introduce the concept of fast-forwarding and forced updates. A typical git push
ends with a polite request: Please, if it's OK, update your branch name _______ to hold hash ID _______ (fill in both blanks). But a forced git push
ends with a command: Update your branch name _______ to hold hash ID _______!
1There's also a vetting process, where the receiver can first inspect the commits and other data before deciding whether to accept them. We'll ignore this complication here.
Fast-forward operations
The polite request is the same as the forced-push command, but says if it's OK. What makes it OK? Let's go back to our graph drawings.
Suppose that we have, in our repository, these commits:
...--F--G--H <-- master (HEAD)
Suppose that the GitHub copy of this same repository has the same set of commits, and the same name, master
, selecting commit H
.
We make one new commit I
that points back to existing commit H
, so that it adds on to the branch:
...--F--G--H--I <-- master (HEAD)
If we now run git push origin master
, we'll have our Git call up their Git and send them new commit I
(it's new: we just made it, they can't possibly have it yet). Then we'll ask them to set their master
to select commit I
.
If they do that, their master
will end with commit I
, whose parent is H
. They still find all the same commits they found before; they've just added some on to the end.
But suppose that, instead of making a new commit I
that points back to H
, we sneakily, somehow,2 remove commit H
from our own master
:
H [abandoned]
/
...--F--G <-- master (HEAD)
Then we make our own new commit I
to be used instead of commit H
:
H [abandoned]
/
...--F--G--I <-- master (HEAD)
If we now send commit I
to our GitHub copy, and ask GitHub to set our GitHub repo's master
to point to commit I
, they will say no! The reason they'll say no is that I
doesn't just add on to the existing commits. If they switch their master
to point to commit I
, they'll lose commit H
entirely.
There are two ways we can convince them to change their master
to point to I
anyway. One way is to use the --force
flag. That changes the Please, if it's OK polite request into the forceful command. GitHub will probably obey this command, as long as we own the repository.3
The other trick we can use is to delete the branch name entirely, or, if there's some administrative way to do this,4 to rename the branch name, so that the name master
is freed up to use to point to commit I
. Then, having gotten the old name master
out of the way, we can create a new master
, which we can point to whatever commit we want.
Since branch names normally move—to add more commits to the end of the branch—Git has a word for this, or maybe two words: fast-forward. When new commits just add on to a branch, that branch-name motion is a fast-forward operation. When the branch name can't just be "slid forward", though—as when we have master
back up from H
to G
before moving forward to I
—that's a non-fast-forward operation.
2We probably did this with git reset --hard HEAD^
or similar, but we might use git commit --amend
to do it all at once.
3Using GitHub's notion of a protected branch, we can make GitHub refuse to update our own commands. If we do that, we will have to go in to GitHub using their web administration interface and de-protect the branch name master
long enough for us to get rid of bad commit H
. We can re-protect the name afterwards, or just give ourselves administrative privileges to override the protection, or whatever we want to do, but the point here is that we have to use GitHub's web interface to override the protections we set with GitHub's web interface.
None of this is a Git operation. Git doesn't have the notion of protected branches at all. This is all stuff that GitHub added on. What this means in practical terms is that if we decide to move everything to Bitbucket or GitLab, we'll have to change how we administer this, assuming that Bitbucket and GitLab even have the same add-on ideas.
4Note that to rename a branch in our own repository on our laptop, we use the command-line git branch -m
command. There's no git push
option to rename a branch; there's just the one to delete it. So git push --delete
can delete the GitHub branch, provided it's not protected from deletion. But renaming would require a web interface page.
Summary so far
In general, what you do is:
- clone a repository (perhaps before or after using GitHub's fork, which we'll get to in a moment) to your local machine (e.g., laptop);
- create new branch names in that local repository, and use those to keep track of new commits that you add;
- then run
git push
to send those new commits to a GitHub repository where you have permission to create or update branch names.
The git push
step will succeed provided that:
- you have permission, and
- the operation is a fast-forward (merely adds commits) or creates a new branch name.
If you get ! rejected (non-fast-forward)
, it means that you've chosen a git push
operation that would first remove some commit from the existing branch name over on GitHub (then maybe, or maybe not, add new commits too).
Assuming you're using git push
to send new commits to a GitHub repository that you own, you can use git push --force
if you wish to override a non-fast-forward error. This tells the Git over at GitHub that, yes, you really did intend to drop the old commits in favor of the new ones.
You'll generally need this after a git rebase
, because rebase works by copying some old-and-now-bad commits to new-and-improved commits. No existing commit can ever be changed, so if some existing commit has a problem, and you want to fix that problem, you need to toss out the old commit in favor of the new-and-improved one. That means you're telling your Git to discard some old commit—but if you've sent that old commit to GitHub, they won't want to discard it either. Your Git knows that the rebase is a replace old and lousy commits with these new improved ones operation, but their Git just sees toss out some old commits, here's some new ones without the due to a rebase part.
Clones and forks
In some cases—such as when you are an employee of some company—you may have direct access to a GitHub repository. It may have protected branches (you can't push directly to master
for instance), but you can create your own branch names in that GitHub repository, and use force-push with those branch names if needed. In this case, the picture stays relatively simple. There are just two repositories you'll worry about:
There's the corporate one over on GitHub. You might need to be just a bit more careful with this one, since breaking it gets everyone mad at you. Just be super-careful with force pushes, making sure that you only do this with your branches.
And, there's your private clone, on your computer (e.g., a laptop). If you wreck this somehow, you can just re-clone the corporate GitHub repository, so you don't have to be quite so careful (though of course losing work is no fun).
In this setup, when you go to make a PR, you just git push
to your own branch, then open the PR. If you need to update your PR, you can either git push
or git push --force
to your own branch: GitHub automatically updates the PR. The set of commits in the pull request is simply the set of commits in your GitHub branch that are not in the "base branch" (another branch in the same repository).
But you might not have direct access to the GitHub repository. There's an alternative method, using GitHub's "fork" operation. Some companies set things up this way for safety reasons, and many open-source projects use this same technique. In this kind of setup, things get a bit more complicated, because now there are three repositories involved.
It's time to take a small detour, and introduce the difference between a clone and a fork. Let's look first at a regular clone.
A clone copies all the commits and none of the branches
Let's start with:
git clone ssh://git@github.com/path/to/repo.git
You run this command on your laptop, having set up your ssh key access to GitHub. Your Git creates a new, totally-empty repository: it has no commits and no branches. Your Git then adds the name origin
as a remote, so that the URL is saved, and connects to their Git at the URL. Their Git lists out all their branch and tag and other names and the corresponding commit hash IDs. Your Git says that you want everything, because you have no commits at all. They package up and send over everything, and your Git creates remote-tracking names for each of their branch names.
The result of all of this is that you now have all the commits, and no branches. Instead of branches, you have remote-tracking names to remember the last commit in each of their branches.
Finally, before returning control to the command line so that you can begin working, your Git creates one branch in the new repository, and does a git checkout
of that one name. The branch name that your Git creates comes from your -b
option, to your git clone
command. If you don't give a -b
option, your Git asks their Git, over at origin
, which name they recommend, and uses that name. Your Git creates your branch of that name based on the commit found by their branch of that name, which is now in your origin/whatever
remote-tracking name.
The final end result is that in your new clone, you have all of their commits and one branch. The one branch you have here is yours, although the name points to the same commit as their branch of your choice. You can now begin adding commits to your branch, or create new branch names.
A fork copies both commits and branch names
When you use GitHub's fork button, GitHub makes a clone, but they don't do it the standard way, that gets remote-tracking names instead of branch names. Since this clone isn't on your computer, they instead copy all the branch names from the original commit. These are now separate names, but your GitHub fork has all the same branch names, pointing to the same commits, as the repository you just forked.
At least, it does now. This is where the problems start. Now that you have your own fork, any changes that update their branch names don't update your fork.
What this means for you is that you should now clone your fork, and, as soon as this git clone
process finishes, you should add a remote to your laptop clone. This remote needs a name and a URL. You already have a remote named origin
; the URL for this remote is your GitHub fork. But you need two remotes.
The usual second-remote-name is upstream
. I'm not a huge fan of this name but don't have my own recommendation, so if you wish to use upstream
, you'll run:
git remote add upstream ssh://git@github.com/path/to/original.git
This path/to/original.git
part is the URL that you need to give to GitHub to access the repository you forked.
Once you've done that, you will need to run:
git fetch upstream
to obtain any new commits they have that you don't—there probably aren't any, unless they've added new commits since you pushed the fork button—and to create, in your laptop Git repository, remote-tracking names for each of their branches.
Let's say that the upstream has branches named main
, feature/short
, and feature/tall
. Your GitHub fork will have branches named main
, feature/short
, and feature/tall
. Your clone on your laptop will have remote-tracking names: origin/main
, origin/feature/short
, and origin/feature/tall
.
You may or may not want to keep those branch names in your GitHub fork. You may or may not want to keep all those remote-tracking names in your laptop clone. But you probably do want to add, to your laptop clone, remote-tracking names upstream/main
, upstream/feature/short
, and upstream/feature/tall
. That's what your git fetch upstream
will do.
Now, as new commits are added to upstream
's main
or feature/short
or whatever, you can run git fetch upstream
to get these new commits onto your laptop. You can then run git push origin upstream/main:main
to send those new commits to your GitHub fork and update your GitHub fork's main
, if you want to do that.
Wait, what's this new git push
?
I've just introduced a new git push
syntax here, so let's revisit git push
:
We run git push origin somebranch
to send our new commits from our branch somebranch
to our GitHub fork and create-or-update the branch name somebranch
over on the GitHub fork.
This uses what Git calls a refspec. The name somebranch
at the end here is short for somebranch:somebranch
. The two names, on the left and right side, have two different purposes:
The name on the left, somebranch
, is for our Git. Our Git looks up the commit hash ID using this name. That's how it knows which commit(s) to send.
The name on the right, somebranch
, is for the remote (origin
, the fork). That's the name we're going to ask (regular push) or command (force-push) them to create-or-update.
What we'll do, now that we have upstream/*
, is transfer new commits from the original repository on GitHub to our fork on GitHub. To do that, we bring those new commits into our laptop Git repository, updating upstream/main
, upstream/feature/short
, and and so on.
Having gotten those new commits, we want to send them to origin
, so we can git push origin upstream/whatever:whatever
. The name on the left of the colon—upstream/main
for instance—locates the commit in our repository that we just got from upstream
. The name on the right of the colon is the branch name we want GitHub to update in our fork.
Making a PR with a fork
Now that we have this fork, we use one other special feature of a GitHub fork. With a GitHub fork, we can make a pull request to the original repository we forked. To do that, we:
- create or update a branch in our GitHub fork (with
git push
from our laptop);
- use the pull request button on the GitHub page to make the new PR.
Any time we git push
to our GitHub fork, GitHub will automatically update the PR. So, if we need to revise our PR, we don't have to close or delete it first: we just have to push—or maybe force-push—to our GitHub branch. Of course, first we'll need the right set of commits, which often means we need to git fetch upstream
and then maybe rebase using upstream/feature/short
or upstream/main
or whatever as the new base. If this creates a non-fast-forward situation, we will have to use git push --force
or equivalent to update our GitHub fork afterward.