0

Ref: The following question from about 9 years ago:
Pull request without forking?

Background:
I am leaning about GitHub/Git, and I am running into issues.  I have searched dillegently but have found nothing that addresses this specific issue - the closest thing I have found is the question noted above.

Issue:
I "forked" a repository intending to do some work, make a change to my own fork, and then create a pull request back to the original project as a way to contribute to it.

I finally figured it out and was able to successfully create a pull request containing my proposed change.

Note that there are other things I want to do to contribute to this project and after I created the pull request, I continued work and made additional commits to my local copy including importing some technical documentation, etc.

Apparently, for whatever unknown reason, after I make a pull request, the pull request "owns" my fork of the original repo and anything I do thereafter becomes a part of that pull request - it doesn't matter if it's related or not, did I push it to the project's branch, did I add it to the PR, or whatever.  It just appears as if by magic, and can only be removed if I remove/revert the changes in my own repository fork.

Does this mean that all work on anything that has to do with that project has to come to a complete stop until that PR is accepted and/or rejected?  If that's the case, how does anyone else, especially a company working on a single codebase, manage to get things done?

Of course, I am sure that this is possible, people do this all the time.

What research I have done has not disclosed anything that seems to address this specific issue, however other answers to different issues seem to hint at the fact that, once you fork a repo and create a pull request, the pull request DOES appear to "own" that instance of your local repo - and the only way to mitigate this is to:

  • Fork the repo.
  • Create an entire branch of the repo and do work.
  • Commit to that branch and create a pull request, then abandon that branch.

To do additional work, regardless of where in the project, you have to:

  • Create an entirely new branch.
  • Do whatever work you wish to do that is supposed to be separate from the original work.
  • Commit to the new branch, create the pull request, and then abandon that branch.

"Rinse and repeat" for any additional work you want to do, eventually having a fork with more branches than a Christmas Tree.

This gives rise to several questions:

  1. Is this true?  Do I understand this correctly?
  2. Why?  This seems to be unnecessarily complex and convoluted, especially with a single contributor.

The last and most important question:

3 . How do I clean up my local copy?  Apparently I should have cloned the repo, then created a branch to work in, then created the pull request.  (i.e. Is there a way to take my updated "main", turn it into a branch and then re-create the original main so I can create additional branches to do additional work?)

I hesitate to just "hack at" the existing repo trying to figure things out as I don't want to pollute the original pull request or screw things up on the upstream project.

Thanks!

Jim JR Harris
  • 413
  • 1
  • 6
  • 15
  • What do you call an `entire branch` ? – Ôrel Feb 17 '22 at 09:11
  • Entire branch = whatever you get when you crate a branch from your local fork of a repo. – Jim JR Harris Feb 17 '22 at 10:18
  • 1
    I'm afraid I went pretty overboard on this one. :-) Need to clean all this up and get back to working on my book... – torek Feb 19 '22 at 08:13
  • @torek - no you didn't, you're doing wonderfully and I am SOOOOO appreciative of the time and effort you put into this. What I really want to do is pull all these answers out, with formatting, and put them all in one document. – Jim JR Harris Feb 19 '22 at 19:04

4 Answers4

4

Note: this is quite long, but you really need to know these things. I've run out of space (there's a 30k limit on characters) so I'll break this into two separate answers. Part 2 is here; part 3 here.

While "pull requests" are not part of Git (they're specific to GitHub1), there are some things we can say about them even without referring specifically to GitHub. Then we can plug in GitHub-specific items later. So let's start with this:

  • Git is all about commits. While Git commits contain files, Git isn't really about the files, but rather about the commits. And, while we use branch names to find commits, Git isn't really about branch names either: it's really just about the commits.

  • This means you need to know all about commits: what one is and what each commit, and a string of commits in a row, can do for you.

So we'll start with a quick overview of a commit, and then look at a string of them in a row.


1Bitbucket also has "pull requests", but they're very slightly different, and GitLab has "merge requests", which are again same-but-different. All of these build on the same base support in Git proper.


Commits

Each Git commit is numbered. The numbers are not simple sequential counting numbers, though: we don't have commit #1 followed by #2 and #3 and so on. Instead, each commit gets a unique hash ID—unique across all repositories everywhere, even if they're not related to your repository at all2—that seems random, but isn't.3 A hash ID is big, ugly, and impossible for humans to work with: computers can handle them, but our feeble brains become confused. So, below, I'll use fake hash IDs where I just use a single uppercase letter to stand in for a real hash ID. Note that for these hash IDs to work, every part of a commit has to be entirely read-only. That is, once you make a new commit, that commit is frozen in time forever. That particular hash ID, whatever hash ID it got, is for that commit, and no other commit—past, present, or future—can ever use that hash ID.

In any case, each Git commit stores two things:

  1. A commit stores a full snapshot of every file (that Git knew about at the time you, or whoever, made it, anyway). To keep the repository from becoming hugely fat, these files are (a) compressed and (b) de-duplicated. As such, they're stored in a format that only Git can read, and nothing, not even Git itself, can overwrite. As we'll see, this solves some problems but creates one big one.

  2. A commit also stores some metadata, or information about the commit itself. This includes, for instance, the name and email address of the person who made the commit (from their user.name and user.email settings, which they can change any time they like, so it's not reliable without verification, but it is still useful). It includes a log message: when you supply one for your own commits, you should write up an explanation of why you made the commit. What you did—such as change one instance of 7 to 14—is something Git can show on its own, but why did you change 7 to 14? Was it to go from weeks to fortnights, or was it because the 7 Dwarfs were all cloned?

Inside the metadata for a commit, Git adds, for its own purposes, a list of raw hash IDs for previous commits. This list is usually just one element long: for a merge commit (which we won't cover here) it's two elements long, and at least one commit in any non-empty repository is the very first commit, where there aren't any previous commits, so that this list is empty.


2This is why the hash IDs have to be so big and ugly. They don't, strictly speaking, have to be unique across two repositories that won't ever meet, but Git does not know whehther or when two repositories might meet each other in the future, and if two different commits have the same hash ID at that time, bad things happen. I call such a commit a Doppelgänger, a sort of evil twin that's a harbinger of disaster. The actual disaster is—or at least should be—just that the meeting of those two Git repositories fails. In some very old versions of Git, worse things actually did happen, due to bugs. In any case it's just not supposed to happen at all, and the size of the hash helps avoid that.

3Current hashes are SHA-1 checksums of all the data in the commit, which includes data about the commits leading up to the commit, hence it's a checksum of the entire history leading up to that point. SHA-1 is no longer cryptographically secure. Though this does not break Git by itself, Git is moving to SHA-256.


Chains of commits

Given the above, we can draw the three commits in a tiny little three-commit repository like this:

A <-B <-C

Commit C is our third and latest-so-far commit. It has some random-looking hash ID, and a snapshot of all the files. One or two files in C differ, probably, from all the files in earlier commit B, and the rest are the same as in B and are therefore literally shared with earlier commit B. So they don't take any actual space. The modified files do take some space, but they're compressed—sometimes very compressed—and might take hardly any space. There's a little space for the commit metadata (which is also compressed, by the way), but overall, this full-snapshot-of-every-file probably doesn't take much space.

Meanwhile, commit C contains the raw hash ID of earlier commit B. We say that C points to B. This means that if Git can find C—we'll see how it can do that in a moment—Git can use the hash ID in C to find B too. Git can then extract, from both commits, all the files in the two snapshots, and compare them. The result of comparing the files is a diff: instructions for changing the files in B into the files in C (or vice versa, if you have the diff done in the other order).

Git, and sites like GitHub, will generally show a commit as a diff, as that's often more useful than showing the raw snapshot. But you can easily get the snapshot instead, if you like: that's sometimes easier for Git than getting the diff. (Because of the de-duplication trick, git diff can quickly skip over files that are the same, but it still has to look at two commits, not just one. So it's kind of mixed as to which is easier.)

Commit B, being a commit, has both snapshot and metadata, and points backwards to still-earlier commit A. But commit A is the first commit, so its metadata doesn't list any earlier commit. That means that all the files in its snapshot are new, by definition. (They'd be compressed and de-duplicated against any files in any other commit, but back then, it was the first commit, so they're only compressed and de-duplicated against themselves. This last means that if the first commit contains 100 identical copies of a big file, there's really only one copy in commit A.)

Branch names and other names

Git needs a fast way to find the last commit in some chain. Git could force us—the humans using Git—to write down the hash ID of the last commit, in this case C. We could save that on paper, or a whiteboard, or something. But that's silly: we have a computer. Why not have the computer save these hash IDs in a file or something? In fact, why not have Git save the most recent hash ID for us?

That's exactly what a branch name is: a place to save the hash ID of the latest commit. Git only needs the latest one, because the latest points back to the second-latest, which points back to a still-earlier one, and so on. This goes on as long as possible, ending only when there is no earlier commit, and that's how Git works: it starts from a commit we tell it about—usually by branch name—and works backwards.

Let's draw a simple chain of commits ending in hash ID H (for Hash), and have the branch name main point to (contain the hash ID of) H:

...--G--H   <-- main

Now let's add a new name, like feature1. This name has to point to some existing commit. We could pick G, or H, or some earlier commit, but it seems kind of natural to pick H as it's our latest:

...--G--H   <-- feature, main

Note that Git has lots of kinds of names—not just branch names—and they all do this sort of thing, i.e., point to a commit. So we can make a tag that points to commit H, for instance:

...--G--H   <-- feature, main, tag: v1.0

Mostly, though, we'll just use branch names, and that's all I'll show here for now.

Doing work on a branch

Git has its own special features for letting us do work. The contents of a commit snapshot are, as we noted earlier, frozen for all time, and only readable by Git itself. So we can't actually work on / with these files, contained in the commit. We have to get Git to extract the files somewhere. That "somewhere" is our working tree or work-tree.

Git also has a very important thing, which Git gives three names: the index, the staging area, and sometimes the cache. We won't cover that here, except to note that when you run git commit, Git actually makes the new commit from the files in Git's index / the-staging-area, not from the files in your working tree. All the files to be committed must be in the staging area: these are the files that Git knows about. Extracting a commit copies the commit's files to the staging area, as well as to the working tree, so that they are there to start with.

In any case, once the files are in your working tree, they are just ordinary files on your computer. They aren't in Git any more. They came out of Git (out of a commit), and you can put them back into Git in a new commit later, but while you do your work, you work on and with files that are not in Git. Only the committed files are in Git.

You do your work with your working-tree files and run git add as usual. (This copies the working tree version of the files you list back into the index, so that they're ready to be committed. It's during the git add stage that Git does the initial compression and de-duplication. The files as seen in Git's index are pre-de-duplicated, in other words. This means the index's copies mostly take no space, except for any file's you've changed-and-added. You can add an unchanged file: this is just a mild waste of time as Git will discover that it's a duplicate and just retain the original. It's a waste of cheap computer time, not valuable human time, so feel free to waste it! But if you know some file is enormous and that this will waste your time too, feel free to skip it.)

In any case, now that your new commit is ready, you run git commit. This:

  • gathers any necessary metadata, such as your name and email address and the current date and time;
  • gets the hash ID of the current commit—the one you checked out to fill your working tree (and Git's index) earlier;
  • freezes the index's snapshot; and
  • writes all this out as a new commit, which gets a new, unique hash ID.

If you had:

...--G--H   <-- feature, main

just a moment ago, then your current commit was H, so your new commit—which we'll call I—points back to H:

          I
         /
...--G--H

Git does, however, need to know which branch name you were using to find H. So one of those two names has the special name HEAD "attached to it". Let's say that this name was and still is feature. Then our drawing now looks like this:

          I   <-- feature (HEAD)
         /
...--G--H   <-- main

That is, Git used HEAD to find the name feature, first to find hash ID H, and now to write new hash ID I into feature.

The effect of this is that the current branch name, whatever it is, now points to the new commit you just made. (Note that the snapshot in I used the index / staging-area, which you updated to match your working tree, so all three match now, just like they did when you started with a "clean" checkout or git switch.) If you make another new commit with the usual modify-files-add-and-commit process, you get:

          I--J   <-- feature (HEAD)
         /
...--G--H   <-- main

If you now git switch main or git checkout main, what Git does is:

  • rip out all the commit-J files and replace them with the commit-H files; and
  • attach the special name HEAD to main.

You now have:

          I--J   <-- feature
         /
...--G--H   <-- main (HEAD)

You are on branch main, as git status will say, and your working tree and staging area are "clean" (match the H commit), with your updated files safely saved forever—or for as long as the commit itself lasts—in commit J, which you can find using the name feature.

If you like, you can now create a new branch, such as feature2, and switch to it (using git branch and git switch, or the combined git switch -c to do it all at once):

          I--J   <-- feature
         /
...--G--H   <-- feature2 (HEAD), main

As you make new commits on this new branch, the branch name automatically updates to point to the latest commit:

          I--J   <-- feature
         /
...--G--H   <-- main
         \
          K--L   <-- feature2 (HEAD)

Note that commits up through and including H are, in Git's terms, on all three branches. Commits I-J are currently only on feature and commits K-L are only on feature2. Commit H is the latest commit on main, though it's not the latest commit ever (that's commit L in your repository, at this point). Moreover, there's no direct relationship between commits J and L: they're just cousins, as it were. They are children of children of a common grandparent, H.

Merging

To understand what's going to happen, we now need to look at the usual harder-case for merging. Git has a shortcut for an easy case, but for various reasons (some good, some less good), GitHub in particular never use this shortcut. The easy case is easier to see once you understand the more general case anyway.

In Git, using git merge is about combining work. Let's draw the two feature branches without drawing in the name main (it may still exist, it's just in the way of what I want to draw). Let's switch to branch feature first:

          I--J   <-- feature (HEAD)
         /
...--G--H
         \
          K--L   <-- feature2

Our current commit is now J, and we'll find J's files in our working tree right now. We now run git merge feature2, and git merge:

  • locates commit J (easy: just read HEAD and then feature);
  • locates commit L (also easy: feature2 contains the right hash ID);
  • locates the best common starting point commit.

That last part can be hard, although here it's really easy to see that this is commit H: the grandfather of both J and L. If Git now compares the snapshot in H to the snapshot in J, Git will produce a recipe that contains all the work you did on feature:

git diff --find-renames <hash-of-H> <hash-of-J>   # what "we" did

By running a second diff from H to L, Git will produce a recipe that contains all the work done on feature2:

git diff --find-renames <hash-of-H> <hash-of-J>   # what "they" did

It doesn't really matter who did which work, at this point: the only things that matter are which files "we" changed, which ones "they" changed, and what changes we made to each of these files. The two git diffs figure this out.

If Git can combine these two sets of changes on its own, it can then apply the combined changes to the snapshot from H. However you like to look at it, this either preserves our changes and adds theirs, or adds together both changes, or whatever. The end result, Git assumes, is the correct snapshot to store in a new commit.

If Git can't combine these changes on its own, Git will stop in the middle of the merge with a merge conflict. The programmer must now come up with the correct result. We'll skip right over this part. We'll just assume that Git came up with the right result all on its own. In that case git merge goes on to run git commit for you.

Normally, the resulting commit M would have commit J as its parent. Our new merge commit does in fact have J as a parent—the first parent—but also has commit L, the commit we named on the git merge command line, as its second parent, like this:

          I--J
         /    \
...--G--H      M   <-- feature (HEAD)
         \    /
          K--L   <-- feature2

The name feature, to which HEAD is attached, moves as usual to point to new commit M. But since M points backwards to both J and L, commits K-L are now also "on" branch feature. This means all commits up through M are on feature, while feature2 still ends at L and does not contain commits I-J.

We can, if we want, delete the name feature2 now: it's only useful to find L directly, and if we don't feel the need to find L directly, we can find it by looking at the second parent of M, whenever we care. If we'd like to add more commits to feature2 now, we should hang on to the name and do that:

          I--J
         /    \
...--G--H      M   <-- feature
         \    /
          K--L--N--O   <-- feature2 (HEAD)

We can now merge feature2 into feature again if we like:

          I--J
         /    \
...--G--H      M-----P   <-- feature (HEAD)
         \    /     /
          K--L--N--O   <-- feature2

making a sort of duck's head picture, though we could redraw this without the lump along the top row too:

...--G--H--I--J--M-----P   <-- feature (HEAD)
         \      /     /
          K----L--N--O   <-- feature2

(not sure what this one looks like).

Fast-forwarding

The special short-cut case Git has for git merge applies in cases like this one:

...--D--E   <-- main (HEAD)
         \
          F--G   <-- bugfix

If we run git merge bugfix, Git will locate commits E and G, and then find the merge base of E and G: the best commit that's on both branches. But that's commit E itself, i.e., the current commit.

Git could go ahead and diff E against itself, to find no changes. Then it could diff E against G to find their changes. Then it would apply those changes to E and come up with a new commit H, and give it two parents:

...--D--E------H   <-- main (HEAD)
         \    /
          F--G   <-- bugfix

Commit H would be a merge commit, with two parents, just like the "real merge" case. But obviously diffing E against itself is silly, and adding their changes just gets us a commit H whose snapshot exactly matches the snapshot in their commit G. So Git will, for this case, not bother merging at all unless we tell it to.

Instead, Git will do what it calls a fast-forward merge. What that means is that Git simply checks out commit G directly, while dragging the current branch name forward:

...--D--E
         \
          F--G   <-- bugfix, main (HEAD)

There's now no reason to draw the kink in the graph at all:

...--D--E--F--G   <-- bugfix, main (HEAD)

and deleting the name bugfix is obviously safe enough, though presumably main will advance further later.

To suppress the fast-forward-instead-of-merge thing, we would run git merge --no-ff. GitHub effectively always do this, so you won't see fast-forward merges occur on GitHub; but it's good to know about them.

When to delete a name

When and whether to delete the other branch name is up to the user. Note that deleting the name does not delete the commits: it only makes it harder to find them. But there is another thing to know. Suppose we have:

...--G--H   <-- main
         \
          I--J   <-- bugfix (HEAD)

where commits I and J simply don't actually work. You'll run:

git switch main
git branch -d --force bugfix

to discard your attempt to fix the bug. This leaves you with:

...--G--H   <-- main
         \
          I--J   ???

Commits I-J still exist, but unless you wrote down J's hash ID, you may never be able to find commit J again.

Git will—eventually—detect that commit J is unreachable (that there's no way for you to find it) and will delete it for real. The same goes for commit I once J is gone. You get a grace period, normally at least 30 days, during which Git won't do this, and various Git commands to help find accidentally-lost commits. But if you don't bother finding them and adding a name back, the "reflog entries" by which Git keeps track of "lost" commits like this eventually expire, and then—when Git gets around to doing its maintenance and janitorial work—the "lost" commits will really go away from this repository. So, while commits are read-only, they are only "mostly permanent". They remain in your repository as long as you can find them (and then a little bit longer).

Clones, remotes, and multiple repositories

Git is not just a Version Control System (VCS); it's a Distributed VCS (DVCS). The way Git does this distribution is to allow for—or rather, strongly encourage—many copies of a repository to exist. As such, a Git repository is:

  • a collection of commits and other Git objects, some or all of which may be in other repositories too; and
  • a collection of names, such as branch and tag names, that help you (and Git) find the commits and other internal objects.

These are stored as two simple key-value databases. The keys in the names database are branch names like refs/heads/main, tag names like refs/tags/v1.2, and many other kinds of names. Each name lives in a namespace under refs/. Each name stores exactly one hash ID.

The keys in the objects database are hash IDs. Each object in this database has some Git internal object type (commit, tree, blob, or annotated tag). The commit objects, along with supporting tree and blob objects, wind up storing your files; and you will mostly just work with the commits and don't normally have to care much at all about these details.

Since commit hash IDs are globally unique, the object database keys in your clone of some repository are the same as the keys in every other clone of that same repository. When you clone a repository, you get all, or almost all, of their commits and supporting objects. But the names database in your clone is entirely separate from theirs.

What this means is that a clone of a repository starts out with no branch names at all. You run:

git clone <url>

or:

git clone -b <branch> <url>

and your Git software creates a new, totally-empty Git repository to start. Your Git software, using your Git repository (I like to shorten this to "your Git") calls up their Git software and points it to their Git repository ("their Git"). Their Git lists out all their branch and tag and other names and the hash IDs that go with them, and your Git then asks for the objects it would like to copy (normally, all of them). For each commit you're going to get, their Git is obligated to offer all of that commit's parents, and the parents' parents, and so on. So you end up copying every commit into your Git.

Now that you have all the commits (and supporting objects), your Git takes each of their branch names and renames them. This renaming process makes use of the concept of a "remote".

A remote, in Git, is just a short name that stores at least a URL (you can have it store various extra features later). The URL is the one you type into git clone, and the name of the first "remote" is always origin.4 So origin from now on means the URL I cloned from, unless and until you change something.

Git uses this name—the origin string—to make up new names for their branch names. Their main becomes your origin/main; their debug becomes your origin/debug; if they have a feature/tall, you get an origin/feature/tall; and so on. These names are not actually branch names; I like to call them remote-tracking names.5 Their function is to remember, for your Git repository, what their branch names are, and what commit each of those names selected, the last time your Git got an update from their Git.

Once this renaming is done, your Git has created remote-tracking names for every branch name they have. You have all of their commits, and can find all of them because your remote-tracking names hold the same hash IDs as their branch names, that they're using to find their commits.

Now, shortly before your git clone finishes and returns control to you so that you can begin working, your Git:

  • Creates one new branch name in your repository, from the -b argument you gave: if you said -b bugfix, your Git finds your origin/bugfix which corresponds to their bugfix and creates your own bugfix, pointing to the same commit.
  • Checks out (switches to) this new branch.

So now your clone has one branch in it, matching one of their branches. If you don't use -b, your Git asks their Git what name they recommend. The usual standard recommendation is their main branch (now normally main; in the past this was master).

Once you have a clone, you can add more remotes, using git remote add. This needs a name for the remote, and a URL; it sets up the remote but does not yet run git fetch. It's time now to talk about fetching and pushing; see the other answer.


4You can choose some other name, but there's almost never any point to doing so. Use origin as the name of the "main remote". You can rename a remote at any point, so even if you don't intend to keep the starting URL, it works fine to let git clone default to origin here.

5Git calls them remote-tracking branch names, beating the poor overloaded word branch from bloody, misshapen beast to barely-recognizable-splotch. Seriously, just drop the word branch here, it doesn't help any.

torek
  • 448,244
  • 59
  • 642
  • 775
  • Are you sure you're not the guy who wrote _The Pro Git Book_ (which is on my list to read, and climbing toward the top)? Suggestion. Edit the top answer, (and maybe the others), with an index to the other parts so that other people, as clueless as I, can find and take advantage of your work. Maybe, you can post this as an article in its own right, here on SO? – Jim JR Harris Feb 19 '22 at 18:43
  • No, that's some other guy named Chacon. :-) I don't have any special control over SO though, not sure how I could index this. – torek Feb 20 '22 at 01:09
  • Sorry, I mis-spoke. What I meant to suggest was links at the top of each part pointing people to the other two parts to make them easy to find. – Jim JR Harris Feb 20 '22 at 15:56
  • Is there a possibility that I could get a copy of the "raw" source for each of these parts? I'd love to put them in a markdown editor and combine the parts into one document - then publish it back here as a markdown document and/or a PDF for people to download and read - maybe someone here will even pin it? – Jim JR Harris Feb 20 '22 at 16:00
  • @JimJRHarris I put in the cross-links, although part 2 doesn't go forward to part 3. Getting the raw markdown isn't too hard (click edit, snarf-and-barf the entire thing, click cancel) but getting that to you is harder... maybe you can get it via the edit function? (You don't have to edit it, just copy, then cancel) – torek Feb 22 '22 at 22:17
3

Part 2—see part 1

git fetch

To run git fetch, you pick a remote and invoke it as git fetch remote. If you leave out the remote name, Git will pick a remote from somewhere, or try the default name origin, depending on a lot of configuration items. If you only have the one single standard remote named origin, running git fetch with no additional arguments is fine: there's nothing else you could mean anyway.

What fetch does is:

  • call up whatever Git software answers the stored URL;
  • have them list out all their names (branches, tags, and others) and corresponding hash IDs; and
  • obtain, from them, any commits they have that you don't.

Note that this is the same action we had for git clone, except that instead of "get all their commits", it's now "get the commits they have that we don't". Since commits have globally unique IDs, we can easily tell that we have (say) commit a123456 because we have some object with ID a123456, and that we lack—and therefore need—b789abc because we have no such ID. Having obtained their new-to-us commits, our Git now updates our corresponding remote-tracking names.

In other words, git fetch does pretty much the same thing as git clone, except that our Git repository already exists, we may get a lot less data, and we don't have a final "create a branch and check it out" step. Since we can have more than one remote, we can run:

git fetch origin

and update all our origin/* names, and then run:

git fetch upstream

and update all our upstream/* names, if we've used git remote add to add a second remote named upstream.

To update all our remotes at once, we can use git fetch --all or git remote update; both do essentially the same thing. Note that --all to git fetch means all remotes, not all branches: we already get all branches. (I mention this because people keep thinking --all means all branches and it never does.)

We can, if we want, limit our git fetch like this:

git fetch origin main

This has our Git call up their Git as usual and list things out, but this time, our Git only bothers asking for any new-to-us commits they have on their main. When everything is done, our Git then updates our origin/main (we know where origin's main is now, so our corresponding remote-tracking name, i.e., origin/main, can be updated). If they have new commits on their dev, we don't get them, and we don't update our origin/dev; our Git was told only bother with main.

In some (rare) setups, this sort of thing can save a lot of data transfers. Git therefore offers something called a single-branch clone, in which git fetch does this by default. This is where people try to use --all (and it doesn't work): to fetch other branches from a single-branch clone, you must either add them—see the git remote documentation—or use an explicit refspec. We won't cover refspecs properly here, for space reasons, though.

Since you will have two remotes, one for your GitHub fork and one for the GitHub repository that you forked, you'll want to run git fetch twice, or use git remote update or git fetch --all now and then. Other than that—and having upstream/*, if you called the second remote upstream as most do—your repository is still just like any other repository.

git push

The git push command is very much like git fetch, with several key differences:

  • First, of course, git push means send stuff. You use git fetch to get new commits (and other internal objects) from some other Git (some other software working with some other repository).. You use git push to send new commits, often ones you made—but they can be ones you just got from upstream for instance—to some other Git.

  • Second, once you've sent these commits, you are typically going to ask the other Git to set one of its branch names. There is no such thing, on the push side, like a remote-tracking name.

That last part means that you have to have permission to write to the repository. Git itself has no real access controls at all, but most web hosting sites, including GitHub, add theirs on. GitHub in particular add a lot of fancy controls here. Whether you and/or anyone else make use of them is up to you and them.

To do a git push, you typically run a simple:

git push <remote> <name>

This says that you'd like your Git to look at commits on your branch named name, find which ones are new to the other Git at origin, send them to that Git, and then ask them, politely, if they would, pretty please, set their name name to point to the same commit that your name points to.

In other words, you are asking them to create or update their branch with the same name as your branch. In general, they will accept this if and only if this simply adds on to their branch (and you have permissions of course). That is, when we had:

...--G--H   <-- main (HEAD), origin/main

because our main matched origin's main, and we added a new commit or two:

          I--J   <-- main (HEAD)
         /
...--G--H   <-- origin/main

and we run git push origin main, our Git calls up their Git, sends them commits I-J, and asks them to set their main to point to J.

If their main still points to H—or somehow, points back to G because someone made them drop H—they'll happily accept our request to add on to their main. Since our Git sees their acceptance, we end up with:

...--G--H--I--J   <-- main, origin/main

knowing that origin's main now points to commit J.

But suppose someone else came along and added some commit K to their main:

...--G--H--K   <-- main [over on origin]

Our request will now ask them to ditch their commit K, which would leave them with this:

...--G--H--I--J   <-- main
         \
          K   ???

They will say no, and the error message you will get is not a fast-forward (remember those from merges? this is the same idea).

You can, using --force or --force-with-lease, try to get them to take the change, losing their new commits, but usually that's the wrong thing to do. For your usage of GitHub, however, sometimes this is the right thing to do on your fork! We'll get back to this later.

There is also a way to delete a name using git push. In fact, there are several, but the clearest is probably git push --delete remote branch: git push --delete origin foobranch would discard foobranch over on origin. This has no effect on your laptop repository.

GitHub "forks"

We have enough background now to define how GitHub's FORK button works. You pick some existing repository that isn't one of your own and click it, and GitHub will make a new repository on GitHub that is your own. This GitHub "fork" is a kind of a clone, but with several added features and one change.

The change is obvious now that you know that git clone copies no branches. When you use GitHub's fork button, it copies all branches. Your new clone has the same set of commits and branches as the original, vs a regular clone-to-your-laptop, which gets all the commits but none of the branches, and then makes one new branch that accidentally-on-purpose exactly matches one of origin's branches. The fork button makes all the branch names in your fork exactly match all of the other repository's branches.

The added features include the idea of making pull requests, which we'll come back to in a moment. On GitHub's side—not visible to you, but very important to GitHub themselves—the added features include not using any space to hold the commits: your fork simply re-uses the commits from the original. No commit can ever change, so this is fine; the only problem that can happen is if a commit were to be deleted, so GitHub simply arrange for commits never to be deleted.1

Once you make the fork, though, those branch names, on GitHub in your fork, do not update any more, until and unless you do it. You can do some from GitHub's web interface (e.g., you can delete a branch name), or you can use git push from your laptop as usual.

Hence once you do have a fork, you will want to clone that fork to your laptop, and then, on the laptop, add a second URL going to the repository you forked. The standard GitHub way of naming this second URL is to use the remote name upstream. I personally dislike this name as the word upstream already has several meanings in Git,2 but to run with it, if you've forked ssh://github.com/them/repo.git to ssh://github.com/you/repo.git, you'd run:

git clone ssh://github.com/you/repo.git
cd repo
git remote add upstream ssh://github.com/them/repo.git
git fetch upstream

You now have origin/* and upstream/* names. We now come to one of the handy tricks.


1This means that if someone accidentally puts a password on GitHub, it is potentially there forever, even if they quickly force-push to hide it. GitHub support can purge commits "for real", but in general, always consider any secret that was accidentally exposed even for an instant to be forever compromised.

2So, better than the word branch at least.


Handy trick: updating your fork

Having run git fetch upstream or git remote update so that your upstream/* names are all updated, you might want to make your own fork have all of their updates under the same branch names. That means that for each upstream/whatever, you want to run:

git push origin upstream/whatever:whatever

This kind of git push uses a refspec, where we put a "source" name on the left, then a colon, and then have the "destination" name on the right. Git will pick the commits up from the given source (our local upstream/whatever remote-tracking name), but when they get to the destination (origin), ask the destination to set their destination-side name (their whatever).

You can do this with a loop, but there's a shorter way. Note that you may need to protect the * character from your shell, depending on your particular command line interpreter:

git push origin "refs/remotes/upstream/*:refs/heads/*"

I've assumed you need double quotes to get the right protection. If you cannot use double quotes, use whatever quoting mechanism is needed (which may be none at all).

Here, we have spelled out the names in full: remote-tracking names live in the refs/remotes/ name-space, while branch names live in refs/heads/. Git matches up the two stars and does a regular (non-forced) git push of each branch here.

You can make a Git alias that does this git push, to avoid having to type a long command and to avoid having to quote the refspec (a simple Git alias does not pass through a shell):

[alias]
    up2hub = push origin refs/remotes/upstream/*:refs/heads/*

Note that this embeds the names upstream and origin, but now you can run:

git up2hub

after git fetch upstream succeeds, to update your GitHub branches.

Pull requests

Now we finally get to the heart of the problem: how pull requests work. When you use the CREATE PULL REQUEST button on GitHub, you pick two things, although GitHub will default one of them for you:

  • a branch name in your fork; and
  • a "base branch" in the other GitHub repository3 against which you wish to make this PR.

GitHub will now run a "test merge", where they try doing a regular git merge of the set of commits that are on your branch, as imported to their repository by the Pull Request, to the current tip commit of their branch. That is, GitHub take each of your commits that you have in your fork and they don't have in their repository at all, and copies those commits to their repository.4

They will now be able to find your commits under the pull-request's full name, which on GitHub is refs/pull/number/head. The test merge, if it works, will create a new commit and make it reachable via refs/pull/number/merge. If it fails with a merge conflict, the PR still gets made, it just has no refs/pull/number/merge name.


3You can make pull requests into a shared access repository, where you and others are all pushing to a single repository rather than each to their individual forks. In that case you would pick this repository itself as the "other" Git repository. But that's just a special case, in which the "other" repository is "this one".

4This all happens virtually, to save disk space: their repository has a link back to yours so there's no actual copying, just like you didn't really copy their commits when you forked their repo. Again, Git gets to use the fact that every commit has a unique hash ID: your commits, with your hash IDs, are guaranteed to have different hash IDs from all of their commits. So when their Git software tries to find commit fee1cab or whatever the hash ID is, and can't find it, they can just look over in your connected fork, and there it is. Your fork refers back to their repo, and their repo refers back to your fork, in a sort of incestuous loop.


So what does this all mean?

Well, let's look at a classic example. You fork some repository, and clone it to your laptop and make a new branch:

...--G--H   <-- main, my-feature-1, origin/main

You make a couple of new commits on your my-feature-1 branch:

          I--J   <-- my-feature-1 (HEAD)
         /
...--G--H   <-- main, origin/main

You send these commits to your GitHub fork:

          I--J   <-- my-feature-1 [on your fork]
         /
...--G--H   <-- main [on your fork]

You then click the buttons to make a PR, and in their fork they now have:

            I--J   <-- refs/pull/123/head
           /    \
          /      M   <-- refs/pull/123/merge
         /      /
...--G--H---K--L   <-- main

Commit M is GitHub's "test merge", which worked; commits K-L are the new commits they made while you were fussing about with your fork and your laptop.

If you now go on to make:

          I--J   <-- my-feature-1, my-feature-2 (HEAD)
         /
...--G--H   <-- main, origin/main

on your laptop, then make two more commits, you get:

               N--O   <-- my-feature-2 (HEAD)
              /
          I--J   <-- my-feature-1
         /
...--G--H   <-- main, origin/main

You can git push these to your GitHub fork, so that it has:

               N--O   <-- my-feature-2 [on your fork]
              /
          I--J   <-- my-feature-1 [on your fork]
         /
...--G--H   <-- main [on your fork]

If you now use this my-feature-2 name to make a PR on GitHub, that PR contains commits I-J-N-O that aren't in their main yet, since they have not yet decided what to do with PR#123:

                 N--O   <-- refs/pull/124/head
                /
            I--J   <-- refs/pull/123/head
           /    \
          /      M   <-- refs/pull/123/merge
         /      /
...--G--H---K--L   <-- main

(plus, perhaps, a test merge, if they were able to merge O with L).

If that's not what you wanted, what you should have put onto GitHub was this:

          I--J   <-- my-feature-1 [on your fork]
         /
...--G--H   <-- main [on your fork]
         \
          N--O   <-- my-feature-2 [on your fork]

Your my-feature-2 now contains only two commits that are not in their main, so that PR#124 makes their repository look like this:

            I--J   <-- refs/pull/123/head
           /    \
          /      M   <-- refs/pull/123/merge
         /      /
...--G--H---K--L   <-- main
         \      \
          \      P   <-- refs/pull/123/merge
           \    /
            N--O   <-- refs/pull/124/head
torek
  • 448,244
  • 59
  • 642
  • 775
2

Part 3: Your next stumbling block: rebase

(Part 1; part 2)

When they, whoever they are, get around to reviewing your PR, they might:

  • accept it as is;
  • ask you to fix something in it;
  • take it with changes they make; or
  • reject it entirely.

The last one doesn't really need much more discussion here, but the other three do.

If they take your commits "as is", they have the choice of using three different "accept this PR" MERGE buttons, on GitHub:

  • MERGE. This one is straightforward.
  • REBASE AND MERGE. This one is less straightforward.
  • SQUASH AND MERGE. This one requires understanding Git's "squash merge", which is not a merge at all.

If they make any changes themselves, that's a lot like the REBASE AND MERGE case, as we'll see. If they want you to make changes, you will need to use git rebase on your laptop, after which you will need to use git push --force or git push --force-with-lease to update your GitHub repository.5


5Technically, you could delete and re-create your branch, instead of force-pushing. I think this kills off the existing PR, though (I have not tried it). In any case the force-push option is what people use in practice.


Rebase is about copying commits

We use git rebase to copy existing commits, which have something we like and something we don't, to new and (supposedly anyway) improved commits that we use instead of the originals.

Before we look at that, let's look at the command that copies one commit. As we know, no commit can ever be changed, but we can always extract a commit, or turn it into a diff, or whatever. We can use this property to take some existing commit, turn it into changes, and apply those changes again, or elsewhere. Git calls this operation cherry-picking, and uses the command git cherry-pick to do it.

Typically we might use git cherry-pick as a quick way to get a one-off copy of a commit for some reason. For instance, perhaps someone had a good idea, or a bug fix, that we need right now on our branch, and we'll figure how to deal with out the mess this is going to make later. Here we are on our branch:

...--J--K   <-- feature (HEAD)

Meanwhile over on their branch, they fixed some nasty little bug that's biting us:

...--P--C--R--S   <-- theirs

(we'll call the fixing-commit after P, the "parent" commit, C, the "child", here, rather than the letter Q I'd normally use). We run:

git cherry-pick <hash-of-C>

to tell Git: Go figure out what whoever-it-is did in commit C, the child of P, by comparing P and C to see what changed. Then make the same change here, on my commit K, and make a new commit out of that. The resulting graph looks like this:

...--J--K--C'  <-- feature (HEAD)

Git will list them as the author of commit C', and re-use their commit message; we'll be listed as the committer of C'. (Most of the time, the author and committer are the same person, but not necessarily; a copy like this normally keeps them as author, and also hangs on to their commit date-and-time stamp as well.)

The way Git actually implements this "copy their change" is that Git has to figure out which files to touch and where those lines have gone. To do that, Git does a diff from P to C to see what work they did, and then a second diff from P to K to see what we did. Recall how git merge works: this is the heart of a merge: combine two diffs and apply the combined diff to the merge base. Git just forces the "base" to be commit P, regardless of all else. Our commit is commit K, and theirs is commit C, and that's all there is to it—except that when Git commits the merge, it makes it as an ordinary single-parent commit. Commit C' refers back to K only, not to P or C.

The end result, if all goes well—if there is no merge conflict—is that we have the changes from commit C, but applied here, at commit K. The new commit C' is thus a copy of C in that it:

  • has some of the same metadata as C: specifically, the log message and authorship; and
  • has the same diff, except for anything that got added or removed if we had to resolve a merge conflict.

With git rebase, we can:

  • take a series of existing commits and simply move them, or
  • using interactive rebase, do all kinds of extra fiddling.

Interactive rebase is a whole big blog-post or article of its own, and we won't go into the complete details here. We'll just look at a simple rebase-to-move job. We start with, e.g.:

          I--J   <-- feature (HEAD), origin/feature
         /
...--G--H--K--L   <-- upstream/main

Let's say, at this point, that we like everything about I-J except that they don't come after K-L. Let's add a further problem: there's a merge conflict, e.g., because one of the lines we touch in J abuts one of the lines we (or they) touch in L. This merge conflict, or thing that Git sees as a conflict, is really easy to fix, but Git won't do it, so we have to do it ourselves.

At this point, we just run:

git rebase upstream/main

Note that we don't have to use a branch name here, and we don't even have to update origin: we can rebase on any commit we have, using any name for that commit. The name upstream/main finds commit L, and that's the commit we want to copy I and J to come-after, so that's the name we give to git rebase here.

Git will, internally, save away the raw hash IDs of commits I and J, which are the commits to be copied. (How Git knows that—and how we can change what gets copied, to where—is answered elsewhere, but note that our current branch name points to J and commits I-J are the only ones that are only reachable from the current branch.) Then, Git will switch to commit L—the one we named—in what Git calls detached HEAD mode. Here there is no current branch at all: HEAD points directly to the current commit. So now we have this:

          I--J   <-- feature, origin/feature
         /
...--G--H--K--L   <-- upstream/main, HEAD

Git now does one cherry-pick to copy I. This works, and Git makes a new commit I':

          I--J   <-- feature, origin/feature
         /
...--G--H--K--L   <-- upstream/main
               \
                I'  <-- HEAD

Git now attempts a second cherry-pick to copy J. This one fails with a merge conflict. We resolve the conflict by opening the conflicted file in an editor, putting the correct merge result into the file, and writing the file back to our working tree, then running git add on that one file. Then we use:

git rebase --continue

to make Git resume committing-and-copying-and-so-on. Git makes the commit for J' (the copy, with our resolution in place, as git add-ed):

          I--J   <-- feature, origin/feature
         /
...--G--H--K--L   <-- upstream/main
               \
                I'-J'  <-- HEAD

and if there were more commits to copy, Git would try cherry-picking the next one. As it is, though, there's nothing left to copy, so git rebase does its final operations:

  • rebase yanks the name feature to point here, wherever HEAD is; and
  • rebase re-attaches HEAD

so that we now have:

          I--J   origin/feature
         /
...--G--H--K--L   <-- upstream/main
               \
                I'-J'  <-- feature (HEAD)

The original I-J commits still exist. If you wrote down their hash IDs (or still have them on your screen) you can still see them. Using the name origin/feature, you can still see them. But if you use the name feature to find commits, you'll find the new copies instead of the originals.

Updating a pull request

To update our existing pull request (PR#125 perhaps, feature into a branch name in the repository from which we forked), we simply tell GitHub to take these two new commits. A plain:

git push origin feature

won't work, because the Git over at GitHub will object: Hey, if I update my feature, I'll lose these two very valuable I-J commits! Not a fast forward! Rejected! We must force it to update, to use the new replacement I'-J'.6 So we run:

git push --force-with-lease origin feature

or the shorter --force variant. (The --with-lease one adds some error checking, and is a good idea, but still feels newfangled and is klunky to type for me, at least. That's one of the disadvantages of having used Git for 15+ years at this point.) We tell GitHub we really mean it, and they take the new commits.

Since there's an open PR referring to the name feature, GitHub will, at this point, try the merge again. The last time they tried this merge, there was a merge conflict with commit L. This time, we add on to their commit L, so there's no merge conflict, and the PR might get accepted as-is.

If we need to make additional changes, we can use the fancy interactive rebase, or do extra commits and then squash, or whatever we like.


6It sure would be nice if Git knew that these were evolved replacements.


Squashing

Git offers something it spells git merge --squash. This is not a merge, in the same way that fast-forwarding is not a merge: there's no final merge commit. But it is a merge, in the same way that git merge does a pair of diffs and combines work: there's a combining-work part.

Given:

...--G--H--I--J   <-- br1 (HEAD)
         \
          K--L   <-- br2

if we run git merge --squash br2, Git will:

  • find merge base H as usual; * do two diffs as usual;
  • combine the diffs, and stop if there's a merge conflict and make us fix up the mess; or
  • if there is no conflict, stop anyway.

This "stop anyway" is, I believe, just an accident of the original implementation—git merge has a --no-commit flag to make it stop, and that should be separate from --squash, but --squash always turns on --no-commit. In any case, though, we must now finish the operation by running git commit, which commits what's in Git's index and your working tree as usual, and does not make a merge commit. Instead of a merge M with two parents, we get a simple, ordinary commit S—the "squash merge"—with one parent as usual:

...--G--H--I--J--S   <-- br1 (HEAD)
         \
          K--L   <-- br2

The snapshot in new commit S is the same as the snapshot we would have, had we done a regular merge, but S doesn't link back to L, and the only useful thing to do with the name br1 now is to delete it.7 The two commits, K-L, have in effect been combined, or squashed, into the single commit S. Commit S has the same effect that K-L had, so K-L are now useless and should be forgotten.


7It's possible to do otherwise, but that's beyond the scope of this answer.


How squash-merge relates to GitHub PRs

Should someone upstream take your PR and squash-merge it, what you gave them—perhaps as multiple commits—is now replaced with a single squash commit:

          I--J   <-- refs/pull/125/head
         /
...--G--H--K--L--S   <-- theirbranch

Here, their commit S represents the work you did in I-J. All your actual work, and at least some of your commit log messages, are replaced by their squash (they may or may not keep some of your commit log message(s)). You should obtain commit S (into your upstream/theirbranch) and use that, abandoning your I-J originals.

What rebase-and-merge does to your PR

Should someone upstream take your PR and use the REBASE AND MERGE button, GitHub's Git software will copy each of your commits to new-and-?improved? commits. They will do so even if there's no real need. For instance, you might have carefully rebased your I-J onto their L so that you had:

                I'-J'  <-- refs/pull/125/head
               /
...--G--H--K--L   <-- theirbranch

at the time you made PR#125. But they clicked the "wrong"8 button because they like linear graphs, so now in their repository they have:

                I'-J'  <-- refs/pull/125/head
               /
...--G--H--K--L--I"--J"  <-- theirbranch

where I" is a copy of I', and J" is a copy of J'. These copies have your original log messages preserved, and have you as author, but they have new and different hash IDs.

You'll need to abandon your original commits in favor of these "new and improved" ones. There's a nice thing—another handy trick—in git rebase that makes it easier for you, though.


8I only call this "wrong" because it needlessly duplicates the commits. For the case where your PR hangs off commit H, the commits did really have to be rebased, to make the graph linear. The effect, though, is that you're the author and they're the committer and those commits have new and different hash IDs.


Handy trick: rebase knows copies

When someone has done this kind of "rebase and merge", you can almost always just use git rebase yourself to replace your original commits—even if you yourself rebased them several times—with their new rebased copies. The reason is that when Git goes to list out commits to copy, it checks to see if those commits are already in the place you're copying to. That is, suppose you have this on your laptop:

               K--L   <-- feature2 (requires feature)
              /
          I--J   <-- feature
         /
...--G--H   <-- upstream/theirbranch

You now make a PR out of feature, and they take it as is but use the "wrong" button so that your git fetch upstream results in this:

               K--L   <-- feature2 (requires feature)
              /
          I--J   <-- feature
         /
...--G--H--I'-J'  <-- upstream/theirbranch

You now have to make commits K'-L' atop commit J'.

You can do this explicitly (git switch feature2; git rebase --onto upstream/theirbranch feature), but:

git switch feature2
git rebase upstream/theirbranch

will do the job. The reason is that Git lists out the four commits I-J-K-L to copy, first, but then looks at commits I'-J' and figures out that those are copies of I and J. The rebase code uses this to drop commits I-J entirely, resulting in:

          I--J   <-- feature
         /
...--G--H--I'-J'  <-- upstream/theirbranch
                \
                 K'-L'  <-- feature2 (HEAD)

In fact, if you also run git switch feature and then git rebase upstream/theirbranch, your Git will simply drop commits I-J from the copying process, leaving you with this:

...--G--H--I'-J'  <-- upstream/theirbranch, feature (HEAD)
                \
                 K'-L'  <-- feature2

This doesn't (quite) work if someone had to manually fix up at least one of the commits. In the old days, before GitHub acquired some extra tools, this could never happen directly on GitHub. Now (that they have these tools) it is at least theoretically possible.

torek
  • 448,244
  • 59
  • 642
  • 775
0

When you do a pull request, you propose to merge one of your branch into a branch of the original repository. Everytime you update your branch the merge is updated. This is quite useful when you do fix, or update after review.

Several solution for your case, the simple close your pull request, create one branch per topic you want to submit (each branch based on the trunk of the forked repository).

Second solution: create a branch to keep you extra work go back to the main branch (or master) force the already submitted branch to the original commit and push it

git checkout -b my_second_feature
git checkout main
git reset --hard <commit_sha>
git push -f
Ôrel
  • 7,044
  • 3
  • 27
  • 46
  • I am new to all of this, so forgive my stupidity. Let me see if I understand this correctly: When I fork a repo, I am creating a local copy for me to "play with" as it were. A "pull request" means I want to merger *my entire branch* back to the project main, not just the file I'm working on? (Seems a bit extreme to me, but then what do I know, 'eh?) – Jim JR Harris Feb 17 '22 at 14:56
  • . . .So, if I want to make multiple fixes on the same codebase, I have to open multiple branches, one for each separate fix that will become a separate pull request? The idea being to leave the main branch of my fork alone, create branches and then, as stuff gets merged, re-pull the main? – Jim JR Harris Feb 17 '22 at 15:01
  • yeap that is, I think you imagine branch as a very big things, this is just a label pointing to a commit – Ôrel Feb 18 '22 at 15:10