0

I unfortunately have little to no experience with Git as for years I have used alternative offline source control products.

I have inherited a project which is managed through Github and I am just trying to work out the best practice for managing two separate repos (staging and production).

I have setup two different remotes, lets call them "origin" and "prod". Origin is pointing to the staging repo and "prod" pointing to the production repo.

I understand that I can select the remote I wish to "push" to:

git push origin main

But taking a step back from the push, when I add and commit, these aren't tied back to a repo. So is there a way I can group commit changes for specific repos?

For example,

git add README.md
git commit -am "FIRST COMMIT"

git add test1234.html
git add test5678.html
git commit -am "SECOND COMMIT"

Can I push both of these commits to the "origin" repo and just the first commit to the "prod" repo?

Sami.C
  • 561
  • 1
  • 11
  • 24
  • You are not _visualizing_ what goes on in git correctly. What you push into remotes are _branches_.... and branches are made up of commits one after the other (what you commit locally).... so, what you want to get is _the right sequence of commits_ in (two? I would assume) branches so that you can push each one to a separate repo. – eftshift0 Oct 26 '22 at 07:20

1 Answers1

1

Can I push both of these commits to the "origin" repo and just the first commit to the "prod" repo?

Yes, but not sustainably long-term.

Here's the first set of things you need to know when using Git:

  • A Git repository is, at its heart, two databases:

    • One database contains commits and other internal Git objects. Each commit stores, indirectly, all the files that go with that particular commit; we'll talk about this more in a moment; but this means that the objects database holds the files, in a commit-oriented fashion. The objects in this objects database are read-only: nothing can change any object once it's stored, and the database itself is essentially append-only. Retrieving an object from the database requires knowing its "object number" (a big ugly random-looking hash ID).

      This first database is the one that's copied (without any changes: nothing in here can be changed) by cloning. The object numbers (especially those for commits) are universally unique: every Git repository everywhere must use the same number for the same commit, and must allocate a new, never-before-used number for any new commit.1

    • The other database is simpler: it contains names—human readable ones—grouped into things like branch and tag names. Each name stores one and only one object number, which turns out to be all we need. This database is entirely read/write; anyone can change anything in it. (Hosting sites like GitHub always add restrictions so that some random person off the internet can't go into your GitHub account and change or remove all your names. You wouldn't use a hosting site if it didn't do this!)

      This second database is not copied by cloning, but is readable by someone making a clone. A git clone command will select some stuff out of it and make changes to that during cloning, so that they can remember your branch names.

  • Cloning a Git repository means copying these databases. As noted above, one of them is (necessarily) copied as-is; the other typically gets modified. What were branch names become what I call remote-tracking names.2 For instance, their main becomes your origin/main.3

  • That's all that's required of a repository, and some (e.g., server hosted) repositories may have just that (though all hosting servers add a bunch of non-Git features too). But you can't do any new work in such a repository, so when you clone one, you get a bit more than just the two databases.

This means that every clone initially has every commit,4 but no branches (or at least no branch names—the word branch in Git is so overused as to be practically meaningless without context). Doing work in a repository with no branch names at all, however, is no fun at all. So your clone normally creates one branch name immediately after doing the database setup, before returning control to you. (You can turn this off with --no-checkout but there's rarely any reason to do that.)

The branch name you get in your clone selects the same commit as the remote-tracking name in your clone, and the remote-tracking name in your clone is built from their branch name, so your main and your origin/main name the same commit as their main and hence your main and their main are in sync. This assumes you're having your Git create your main: you get to choose which of their branch names your Git creates, using the -b option at git clone time, but if you don't use -b, your Git software asks their Git software what name they recommend. (On GitHub, you can set this with the web interface; GitHub call this the "default branch", which is what Git calls it for git clone as well.)


1Furthermore, Git has to do this without consulting all the other Git repositories in the universe. This is mathematically impossible and the technique that Git uses to approximate it will someday break, but the size of the hash space is big enough that it's not a problem in practice. (We get a huge break as well by not smushing together unrelated repositories, so that we never notice accidental hash collisions in those unrelated repositories: the uniqueness constraint gets relaxed to apply only to related repositories.)

2Git calls these remote-tracking branch names. Old Git documentation rather sloppily called them "remote branches" or other silly phrases. They're not actually branch names at all—they're just your Git repository's memory of some other repository's branch names, transformed—and the word branch is so badly overused in Git that it's better, in my opinion, to just drop the word branch here entirely.

3The pattern here is mostly clear: a branch name doesn't have a remote name like origin in front, and a remote-tracking name like origin/main does. But then you get weirdness like git pull origin main and you might start to wonder why this isn't origin/main here too. There are historical reasons for that, i.e., no good reason now except that Git has to be compatible with Git-the-way-it-was-used-in-2005, before "remotes" were invented.

4Technically, you get only the reachable commits, and you can snip off many of those with a "shallow" clone as well, but we won't get into these details.


The next things to know about Git

I've already mentioned that the true identity of a commit is its hash ID name. For instance, d420dda0576340909c3faff364cfbd1485f70376 is a particular commit in the Git repository for Git (the link goes to it). It has the same hash ID in every clone of the Git repository for Git, so if you have a clone of the Git repository for Git, you will have this commit in your clone, or else your clone is out of date and you need to run git fetch to bring it more-up-to-date from a more-up-to-date copy.

The fetch and push commands are how we transfer commits. These commits, like all Git objects, are strictly read-only: once created, they can never be changed, and once they've been distributed Out There to other Git repositories it's generally difficult or even impossible to recall them (there are specific cases where you know how far they've spread and can stamp them out like some evil COVID strain, but given how frequently Git repositories have Git-sex with other related Git repositories, it's usually too late).

The fetch command generally means call up some other Git software and repository and get everything that they have, that I don't. You can limit the fetch somewhat, but that's not the norm. By contrast, the push command generally means call up some other Git software and repository and give them specific commits that I have that they don't. If we push the sex analogy a little too far, this means it's up to the "male" ("push") operations to be responsible.

Now, the thing about commits is that they don't just have a full snapshot of all the files, like a tar or zip or WinRAR archive or whatever. They also carry some metadata, or information about the commit itself. This includes the name and email address of the person who made the commit, for instance. But it also includes stuff that Git relies on internally: specifically, every commit has a list of raw hash IDs for earlier commits.

This list of earlier (or parent) commit hash IDs, stored in each commit, is how commits form history. Most commits contain exactly one hash ID: one parent commit. This forms a simple, backwards-looking, linear chain, and this explains how Git works and why the answer above is "yes but not sustainably".

Let's draw a picture of a tiny, three-commit repository, using uppercase letters to stand in for the commit's hash IDs (which in reality are big and ugly and impossible for humans to work with). We'll call the first commit A, the second one B, and the third one C, and draw them like this:

A <-B <-C

Commit C stores commit B's raw hash ID. We say that C points to B, and draw that as an arrow sticking out of C, pointing to B. What this means is that if we can somehow memorize the hash ID of commit C, and give that to Git, Git can retrieve commit C from its all-objects database, and that gives Git the hash ID of commit B, so that Git can retrieve commit B too.

Having retrieved both commits—and the two snapshots—Git can go on to compare the two snapshots, to see what changed between them. Git essentially plays Spot the Difference here.

Equally important, now that Git has commit B in hand, it can use the metadata in B to get A's hash ID, from which Git can get commit A. (Commit A is special: its list of parents is empty. Git can now stop going backwards.) So by memorizing one hash ID—that for commit C—we had Git find all the commits.

The second database, holding branch names and other names, is how we have Git do the memorizing for us:

A--B--C   <-- main

The name main holds C's hash ID, so that it is easy to find commit C.

If we like, we can now create a second name, such as develop. We must pick any one of the existing commits so that the name develop selects that commit. The most obvious candidate is the newest (presumably latest = greatest, right?) commit, C, so that's the one will probably pick:

A--B--C   <-- develop, main

Now we need a way, in our drawing, to know which name we're going to use. This will be our current branch name. To draw which one is current, we'll attach the special all-caps name HEAD to just one branch name, like this:

A--B--C   <-- develop, main (HEAD)

This means we are using main and therefore using commit C.

The current branch and commit, and your working tree

Git shares a problem that all version control systems have: if previous versions are frozen for all time (and they are), how do we get any new work done? Git's answer is the usual one: there's an additional area, which Git calls your working tree or work-tree, that holds copies of all the files from the commit you've selected.

So: we use git switch or the older git checkout command to pick some particular branch name, and thereby some particular commit, and Git extracts all the files from that commit and puts them into our working tree. We now have all the files from commit C:

A--B--C   <-- develop, main (HEAD)

as we're "on" branch main and the name main selects commit C.

If we run git switch develop, we get:

A--B--C   <-- develop (HEAD), main

Git needs to attach the name HEAD to develop, remove all the commit-C files, and plug in, instead, all the commit-C files. Wait, those are the same files. Git doesn't need to bother to do anything to the files. So it doesn't, and this is (later) an important thing to know: if we switch branch names without switching commits, nothing at all happens in the working tree.

Most version control systems stop here, with the two copies of each file: the frozen one in the current commit, and the usable one in your working tree. Git goes on to add a third copy, in an area that Git gives three names, perhaps because it's so important, or perhaps because the original first name for it was so awful: this is the index or cache or staging area. All three names mean the same thing here; staging area refers to how you use it and is arguably the best name, but I tend to use index because it does extra stuff at various times. We won't get into any of the details here, but I always try to mention it since the index / staging-area is the key to making new commits.

Anyway, let's just assume for now that you know how to make a new commit, and let's go about making a new commit now, which we'll call D, since that's the next letter. In reality it will get a new, unique hash ID—it will depend on, among all the other details, the exact second at which you make the new commit, so I'd have to know that to predict it—so, well, D. New commit D will point backwards to existing commit C because C is the commit we're using when we make D. So let's draw commit D:

A--B--C
       \
        D

Oh dear, I left out the branch names. What happened to those? Well, we're not "on" branch main now, so nothing happens to that name. We are "on" branch develop now, so Git shoves the new commit's hash ID into that name. The result looks like this:

A--B--C   <-- main
       \
        D   <-- develop (HEAD)

Our new commit D links back to existing commit C, and our branch name develop locates commit D. If we make another new commit E we get:

        D--E   <-- develop (HEAD)
       /
A--B--C   <-- main

where I have for some reason drawn develop up top this time.

Now let's run git switch main:

        D--E   <-- develop
       /
A--B--C   <-- main (HEAD)

This time, we are changing commits, so Git removes all the commit-E files and swaps in all the commit-C files.5 If we look at what we have in our working tree, it's gone back in time to commit C. (That's also the last commit in the repository we cloned, assuming we cloned a repository to get A-B-C originally.)

Let's make another new branch name, feature, now, and switch to that name. We can do this in two steps:

git branch feature
git switch feature

or one:

git switch -c feature

In either case we get:

        D--E   <-- develop
       /
A--B--C   <-- feature (HEAD), main

If we now make another two new commits, we get:

        D--E   <-- develop
       /
A--B--C   <-- main
       \
        F--G   <-- feature (HEAD)

This is really what commits and branches are about in Git. We make new commits, and Git makes the new commits link backwards to the old ones; as we make each commit, Git stuffs its new, unique hash ID into the current branch name, which is how we remember which commit is the latest.


5Git actually "cheats" here. Git de-duplicates files within and across commits, and because of the way it does that internally, Git knows, instantly, which files are identical in the two commits, when switching. So Git doesn't bother to swap out the files that haven't changed. This is a more general form of the "if we aren't changing commits, don't touch anything" case. It's also quite useful, for when we start making changes but forget to switch branch names first.


This in fact is how branch names are defined

The hash ID stored in a branch name is the last commit on that branch:

        D--E   <-- develop
       /
A--B--C   <-- main
       \
        F--G   <-- feature (HEAD)

Here, G is the last commit on feature. If we force Git to store D into develop and F into feature—there are various ways to do that—we get:

          E   ???
         /
        D   <-- develop
       /
A--B--C   <-- main
       \
        F   <-- feature (HEAD)
         \
          G   ???

It's as if commits E and G no longer exist. We can't find them, because we find commits using branch names and the names no longer point to them. (If we've memorized their hash IDs, we can use those to find them, for a while at least.6) The branch develop ends at D now, not at E. Note that commits A-B-C are on all three branches.

Since we've never sent commits E and G to any other Git repository, we can be sure that they won't be in any other Git repository. Neither are D and F yet, but we can now use git push to send either or both of D and/or F to some other Git:

git push origin develop

sends commit D to the Git repository over at origin, whatever URL that might be. Then it asks that Git software to create or update its branch name develop to point to commit D. Or, if we decide develop is the wrong name, we can run:

git push origin develop:feature-1

to send D to them, but ask them to create or update their branch name feature-1. There's no need to use the same name on both sides—well, except, perhaps, to retain your own sanity. (If you're going to rename this feature-1, it might be better to do that locally first, then use git push origin feature-1.)

(Note that we don't have to be "on" any particular branch to git push it either, and—for reasons we won't get into here—the first time we git push a branch name we probably want to add the -u option.)

If we like, we can actually send both D and F in one git push:

git push origin develop feature

which sends the commits we have that that they don't, that they'll need—i.e., D and F since they gave us A-B-C in the first place—and asks them to create or update their develop and feature names to remember D and F specifically.


6If you can't find a commit in your repository, and you leave things that way long enough, Git may decide that you don't really want it back after all, and can delete it. Any other repository that has it can send it back to you later, after all, and you've indicated your lack of interest now by making it invisible. This is technically a garbage collection operation, done by git gc. When and whether any particular Git software does this GC is up to that particular Git software.


Multiple remotes

You can have more than one remote. All a remote is, is a short name for a URL, by which your Git:

  • remembers that URL for you, and
  • when you use git fetch remote, calls up that other Git software, gets all the commits they have that you don't, and updates your remote/* names.

So you can:

git remote add prod <url>

and then git fetch prod to get from that other Git any commits they have that you don't, and to create or update your prod/* remote-tracking names.

But: if the repository at that URL is related to the repository at your existing origin URL, and in fact got all its commits from the repository at your origin URL, you will already have all those commits. You got them when you got all the commits from origin. So the set of commits you need is empty. Your Git will still create or update all your prod/* names, and since this is the first time you've done this with the remote name prod, that means your Git will create them all.

The thing is, all those commits will literally be the same commits that were on origin.

This sort of thing means that there's no point in having a separate repository for the production system. Just have one repository, and use different names to select the "most recent" development / test commit, and the "most recent" production commit.

It's certainly possible to do what you're describing. There's just no real reason to bother.

torek
  • 448,244
  • 59
  • 642
  • 775
  • wow firstly thank you for putting so much effort into this response. I have read it once and got a little bit lost so I will give this another read shortly. Just to clarify your last statement, there is no point having two repositories - the current environment has two separate servers each pulling their code-base from separate GitHub repositories. One server points to the 'production' repo and the other points to the 'staging' repo. I'm not sure changing this to look at different (branches ??) will be easy to manipulate. With this in mind... help?? – Sami.C Oct 26 '22 at 10:49
  • Actually I think "GitHub Environments" might be the path the previous developers went down... – Sami.C Oct 26 '22 at 11:24
  • Let me put it this way: either the commits in the two server-side repositories *are* partly the same and thus have strong family relationships, in which case there's no point in having separate repositories ... or, the commits are *completely unrelated*, in which case you don't want to use a single repository on your (client) side in the first place. (And: I'm not sure what you mean by "GitHub Environments". GitHub provide environment variables holding secrets, so that you don't have to put secrets such as passwords into a repository. Putting secrets into a repository is generally a bad idea.) – torek Oct 26 '22 at 23:31
  • I'm referring to this: https://docs.github.com/en/actions/deployment/targeting-different-environments/using-environments-for-deployment When I look in the previous developers repositories, the repo counter says "1" but when I click into it there are 2. When I click into the production one, there is an "Environment" section. If I go into the staging actions, there is a "deployment" section. It *appears* as though I can manually deploy specific commits to the production. Have I misunderstood? – Sami.C Oct 27 '22 at 00:12
  • Aha: those are features of "GitHub Actions". These Actions are not part of Git, they're a GitHub add-on. You simply store a file with a magic name (`.github-whatever`) and GitHub notice that such a file exists in some particular commit in some particular repository. They have their software read this file and do operations based on the file's content. There's nothing "Git-ty" going on here, except that each commit that has one of these files is obviously a candidate for action. I'm not a GitHub actions expert and don't know precisely how GitHub choose which commits' file(s) are used. – torek Oct 27 '22 at 00:15