Some of this depends on just how much trust you want to give to Git, because in the bad old days (git 1.5 or 1.6 or so) I have seen Git send objects to a remote that it should not have sent. So I, at least, would not be this trusting—but this is how it is supposed to work.
We need some definitions, or we will get tripped up by some not-so-great naming in Git. They are below (towards the end) so that you can skip them if you are already familiar with them.
The bottom line is that when you do the push
, your git will send, to the remote, your branch tip commit and all commits reachable from that point (i.e., all of its ancestors) up to whatever commits the remote already has. It will also include all files needed for those commits. It should send only those commits and files, i.e., if some sensitive file or data is not in those commits, and the remote does not have it already, the remote should not have it after the push.
Ultimately, this probably means that you should keep two different sets of branches for the two remotes. Or—better—don't put any sensitive data into Git in the first place. Keep it somewhere outside the repository. If you need a configuration file that may contain such data, include a sensitive-data-free sample configuration in the repository, and use .gitignore
to avoid putting the real configuration into the repository.
Discussion
Remember that a branch name "points to" a particular commit (by containing its ID):
$ git rev-parse master
3ad15fd5e17bbb73fb1161ff4e9c3ed254d5b243
and that commits point to their parent commit(s) (by containing their IDs), as well as to trees that point to blobs that get you the files that go with those commits. To work with commits, Git needs some starting-point, such as this SHA-1 hash 3ad15fd...
. It will work backwards from there, following all of these pointers, in order to check out any particular commit. Each ID (of a commit, tree, or file) gives Git the "true name" of the underlying object, and Git uses this true name to extract the actual contents.
Hence, if there is sensitive data in some file, the way Git stores it is under some blob ID. The way you see it through Git is by starting Git with the ID of a commit, which Git uses to find the tree and blob IDs. Git then extracts the blob object, using the name specified by the tree(s), so that you now have a file with the right name and with those sensitive data contents.
If Git does not have the blob object at all, then clearly, you cannot get it out of Git. If it has the blob object, but no commits point (through trees) to it, you cannot get it by file name—but you can, with some maintenance commands, have Git show you every blob's raw ID and extract them all by ID and thus find that data. (Usually it's even easier than that, just git fsck --lost-found
.)
This means that to ensure the sensitive data are not in some Git repository, you must make sure that the blob itself is not there. This also means that any commits that refer to the blob must not be there, as Git won't allow you to have a commit that has "missing" blobs. (Such a repository is considered broken: you get errors trying to use it, though Git will do its best to let you recover whatever data you can.)
git push
syntax and semantics
The syntax for git push
, simplified to only what we care about here, is:
git push remote refspec
When you run this, git starts by connecting to the named remote
according to its url
setting (actually, its pushurl
if set, falling back to url
if not). The remote, usually some far-away host such as github, runs a command, typically git receive-pack
, which then reads requests from your own git push
.
Your git push
starts by sending the IDs of commit objects, essentially offering those objects to the receive-pack
running on the remote. The remote replies with either "I have that already" or "ok, I'll take that, send it to me." It is your Git's job to offer only those objects that are needed to complete the push. Your Git should start by offering the commit ID found by parsing the src
part of the refspec you supplied, and then offer trees and blobs needed to complete that commit, and ancestors of that commit, until the points where the remote says "I have that one already". This is how your Git knows what to send, and is able to send only whatever is not already on the remote.1
Your Git then packages those objects and sends the package. This should contain only those objects that were offered and accepted. These may be further compressed against objects that your Git knows their Git has based on its claims that it already has them.
(This is where, in the bad old days, things seemed to not work as desired, because my sending machine would send unnecessary, and in fact unreachable, objects to the remote. This is not supposed to happen. I'm not sure if it was a failure in the offer/accept phase or in the pack-building phase, or if it was caused by other rather unorthodox stuff I was doing at the time.)
Finally, your Git sends their Git a request that amounts to "now please set your reference ref
to a particular hash", where ref
is the name from the dst
part of the refspec, and the hash is the ID your Git found by parsing the src
part of the refspec. Their Git can either decide to allow this, or to reject the request, based on whatever rules they set up. (The default rule is to allow it for branches if and only if it is a fast-forward, or a new branch creation. I'm also glossing over deletions here.)
There is a bit of magic involved to turn a short-name dst
name into a full-name reference: git checks your src
to see if it is a branch or tag, and expands the dst
to start with refs/heads/
or refs/tags/
as needed. If you give a full-name dst
, your Git skips this step. If you omit the :dst
part of the refspec entirely, your Git constructs the full branch or tag name for their Git according to some rather complicated rules. For branches, usually the result is just the same full name as for your own branch, though.
In other words, if you do:
git push remote1 mybranch:theirbranch
then your Git will call up the Git for remote1
over the Internet-phone (assuming a remote URL), package up whatever they need to get all objects (commits, trees, and files/blobs) they do not already have that they would need for your mybranch
, and ask them to make their branch theirbranch
point to that commit.
If you then do:
git push remote2 differentbranch:theirbranch
your Git will call up the Git on remote2
and send it whatever objects it needs to match up with your differentbranch
, and ask them to set their branch theirbranch
to point to the ID that differentbranch
names.
You can also, in this particular case (this hash is from the repository for git itself), do:
git push remote3 3ad15fd:refs/heads/branch
Note that this time, you have specified a raw commit ID, so you must spell out the full name of the branch for the remote. As before, this will call up the remote, converse with it to see if it already has 3ad15fd5e17bbb73fb1161ff4e9c3ed254d5b243
, send it that commit if needed, also sending it any ancestor commits, trees, and blobs needed. Finally it will send a request for their Git to set their branch
to 3ad15fd5e17bbb73fb1161ff4e9c3ed254d5b243
. If they accept, they now have that commit and all its ancestors (including each merged-in commit and its ancestors), along with all the trees and blobs for all those commits.
1This skips over an important optimization: the remote actually starts by listing SHA-1s and references it already has, which lets your Git not even bother offering those.
Definition: remote
A remote is simply a name, like origin
or upstream
, under which you store various entries in your local repository's configuration. The one that you must set (usually, initially, implicitly by cloning) is the url
, something generally of the form git://...
, http://...
, or ssh://...
. Git stores several other configuration entries here as well though, including one or more fetch =
entries. Use git config --edit
to view your configuration in your editor (be careful not to modify it), or just run less .git/config
or similar to view it, and you will see things like:
[remote "origin"]
url = git://some.host.somewhere.com/path/to/repo.git
fetch = +refs/heads/*:refs/remotes/origin/*
Again, the remote is just the name, in this case origin
.
Definition: branch
The word "branch" in Git is ambiguous. The meaning I use here is that of the moveable pointer, and more specifically the branch name, such as master
or develop
. This name has a "full name" variant, so that if you create a (regular, local) branch whose name is origin/Bruce
—this is a bad idea, but it does happen by accident—you can name origin/Bruce (the branch) differently from origin/Bruce (a remote-tracking branch with the same short name, but a different full name). The full name is what you get if you write refs/heads/
in front, i.e., refs/heads/master
or, in the case of the poorly named Bruce branch, refs/heads/origin/Bruce
.
Git tends to strip off the refs/heads/
part when showing your branch names, since it is normally not needed. (It may matter with unfortunately-named branches like origin/Bruce
, but Git still strips off the prefix.)
Definition: remote-tracking branch
A remote-tracking branch is just a name whose full-name form starts with refs/remotes/
and then includes the name of a remote. Hence refs/remotes/origin/master
is a remote-tracking branch for branch master
as found on remote origin
, while refs/remotes/origin/Bruce
is a remote-tracking branch for branch Bruce
. Git tends to strip off the refs/remotes/
when showing you such names.
Definition: hash (SHA-1) and object (commit, tree, blob, annotated-tag)
A Git hash is one of those big ugly SHA-1 strings like 3ad15fd5e17bbb73fb1161ff4e9c3ed254d5b243
. These are the true names of the entities that git keeps internally. Every commit has a unique hash; in fact, every unique object has a unique hash.
An object, in Git, is an internal storage item whose name is one of Git's hashes. There are four kinds of objects; the two most interesting ones here are "commits" (which store commits) and "blobs" (which store file content). The other two are "tree" objects (which produce name to ID mappings, so that git can tell that, e.g., a blob named 9a31f...
should be made accessible under the name myfile.txt
, for instance) and annotated-tag objects (which store annotated tags, which normally point to commits).
Definition: reference
A reference, in Git, is just the general, full-name form used for branches and tags, and several other things as well. What we care about here is mostly just branches. You can usually use short names—the exceptions occur when using plumbing commands like git update-ref
, or when you have unfortunate names like the two origin/Bruce
s—and Git will figure out the full name for you.
Definition: refspec
A Git refspec is just a pair of references separated by a colon :
, with an optional leading plus sign. The reference on the left is the src
(source) reference, and the one on the right is the dst
(destination). In some, but not all, cases, you can omit either src
or :dst
. When leaving out the :dst
part, if there is no leading +
, it looks just like a reference name, and you just have to know that it is actually a refspec instead.
Definition: ancestor
One commit is an ancestor of another if, by following all the various parent ID pointers, we can get from the second commit back to the first commit. For Git's purposes, a commit is also an ancestor of itself. Hence a commit is its own parent and its own child, while its parent is only its parent. A grandparent is an ancestor; a child or grandchild is not.