Note: if this is TL;DR, skip down to the last section, How to fix it (but it will make more sense if you read the preceding ones).
What you need to understand is that git filter-branch
copies commits. That is, it takes each existing commit, applies some filter or set of filters to it, and makes a new commit from the result. That's how you ended up with two sets of commits. This is necessary because it is in no one's power, especially not Git's, to change anything about any existing commit.
The filtered commits are a new history, largely independent of the original history. (Some details depend on the precise filters and commit inputs.) It's worth keeping in mind that a Git repository does not contain files, precisely; it contains commits, and the commits are the history. Each commit contains a snapshot—so in that sense, the repository does contain files, but they're one step below the overview, which is on a commit-by-commit basis.
Every commit has a unique hash ID. These are the big long ugly names you see in git log
output: commit b7bd9486b055c3f967a870311e704e3bb0654e4f
and so on. This unique ID serves for Git to find the commit object, and hence the files; but the hash ID itself is simply a cryptographic checksum of the full contents of the commit. Each commit lists the hash ID of its parent commit (or commits) as well, and the parent hash (and the snapshot hash) is part of the commit's contents. This is why Git can't change anything about a commit: if you take the contents, and change anything, even a single bit, and make a new commit out of that, you get a new, different hash ID, which is a new, different commit.
Since each commit contains the ID of its parent(s), this means that if we somehow tell Git—by hash ID—which commit is the newest, it can pull that commit out and use it to find the second-newest commit:
... <--second-newest <--newest
The second-newest points back to the third-newest, and so on. If the chain is totally linear (if there are no branches and merges), we end up with a very simple picture:
A--B--C--D--E--F--G--H <-- master
Here, the name master
remembers the actual hash ID of the latest commit, which we'll call H
instead of coming up with its actual hash ID. Commit H
remembers the hash ID of the previous commit G
, which remembers the ID of F
, and so on. Commit A
is the very first commit, so it just has no parent at all, which lets the action stop.
Branching is just a matter of picking out some commit in the chain and creating a child that's not at the tip of master
. For instance, suppose we leave master
where it is, pointing to H
, and make a new commit I
on a new branch we call dev
:
...--H <-- master
\
I <-- dev (HEAD)
If we then git checkout master
and make a new commit J
we get:
...--H--J <-- master (HEAD)
\
I <-- dev
Note that the act of putting new commits into the repository requires that we have Git change one of the names. We put new commit I
in, and made Git change the name dev
—which used to point to H
along with master
—so that dev
points to (contains the hash ID of) I
. Then we put new commit J
in, making Git update master
to point to J
instead of H
.
(The special name HEAD
is simply attached to whichever branch-name is the one we want Git to update when we run git commit
.)
Filter-branch
The filter-branch command iterates over some set of commits—often all commits, depending on how you use it; you ran it over HEAD
which means the current branch, but perhaps you have only one branch name, master
—and copies them. It starts by listing, in the appropriate order, every commit hash ID that is to have the copying process applied. If all you have is a linear chain (like A-B-...-H
), this is those IDs in that order. Let's assume this for simplicity.
Then, for each such commit, filter-branch:
- extracts the commit into a temporary area (or pretends to, for speed);
- applies your filter(s);
- uses
git commit
or equivalent (depending on filters, again) to make a new commit that preserves every unchanged bit, but keeps whatever changes are made.
If the new commit is 100% identical, bit-for-bit, to the original, the new hash ID is the original hash ID. Let's say that happens for A
itself: there are no changes to make, so Git re-uses the ID. The repo contents now look like this:
A--B--C--D--E--F--G--H <-- [original master]
.
...<-- [new master, being built]
Then Git moves on to the next commit hash ID in the list, which is B
. Let's say that the filter makes some change this time (removing a big file), so that the new commit has a new, different hash ID, which we'll call B'
:
A--B--C--D--E--F--G--H <-- [original master]
\
B' <-- [new master, being built]
Filter-branch moves on to C
. Even if it has no change to make to C
's snapshot, filter-branch is forced to make one change now: it must make a new C'
whose parent is B'
, because something happened to B
. So now we get C'
:
A--B--C--D--E--F--G--H <-- [original master]
\
B'-C' <-- [new master, being built]
This repeats for all the remaining commits. All of them get new hash IDs, maybe in part because something in the snapshot changed, but certainly because their parent hash also changed. At the end, git filter-branch
rewrites the name master
itself to point to the final copied commit, H'
:
A--B--C--D--E--F--G--H <-- [original master, now in refs/original/]
\
B'-C'-D'-E'-F'-G'-H' <-- master
All of this happens purely in your local repository—no other Git, no clone of the original repository, knows that any of this has occurred.
(Note that if you do multiple filter-branch operations, each one copies the chain of commits. Some of the intermediate results may be of no real value. Git will eventually garbage collect the unused and unreachable commits, typically after about a month. Since filter-branch copies things, you will see space usage increase a bit, rather than decrease, until the eventual garbage collection and subsequent rebuilding of pack files.)
Where things went wrong
Where things went wrong is definitely not where you think; I think the problem most likely occurred here:
After that I clicked the Sync button in the GitHub Desktop client
I have never used the GitHub Desktop software, so I can't be certain of what it does when. But this is most likely when:
[something] created a new commit named Merge remote-tracking branch 'origin/master'
because git filter-branch
does not do that—well, not unless you write a very complicated filter. What does do that is git merge
: you connect to another Git, which still has the original A-B-...-H
sequence, your Git sets your origin/master
to remember their H
, and your Git runs a merge that connects their H
to your H'
:
A--B--C--D--E--F--G--H <-- origin/master
\ \
B'-C'-D'-E'-F'-G'-H'-I <-- master
where I
is a merge commit that has two parents.
How to fix it
What you'll need to do, now that the only copies of the repository you have are the "dual commits" version, is:
Assuming you have only one master
and that you have that checked out now, git reset
is the way to go. (You can only use git branch -f
on branches that do not have HEAD
attached. You can only use git reset
on branches that do have HEAD
attached.) Find the commit you want to retain, i.e., the filtered one, which will be the first parent of the merge commit, and tell Git to make the name master
point to that commit, abandoning the merge. Note that this will lose any unsaved work; and this also assumes you have not made any commits atop the merge:
$ git reset --hard HEAD~1 # or HEAD^
Now the picture looks more like this:
A--B--C--D--E--F--G--H <-- origin/master
\
B'-C'-D'-E'-F'-G'-H' <-- master
which is basically the same as what you had after the series of git filter-branch
commands: the only real difference is that we're showing the name origin/master
as the way your Git finds commit H
. (The Git over on origin
is using its name master
to find commit H
in its repository. Your Git is remembering their master
as your origin/master
.)
If everything now looks good, your remaining job is to convince their Git—the one over at origin
—to take your new chain of commits and to move their name master
so that it points to commit H'
, the final corrected copy you made of your original H
. To do that, you will use git push
. However...
If you just run git push origin master
to send them your copies and request that they change their master
to point to commit H'
instead of commit H
, they will say no. Making that change would cause their Git to "forget" or "abandon" commit H
, which would lose commit G
, which would lose commit F
, and so on, all the way back to whichever commit(s), if any, you retained. But you can change your polite request, Please, if it's OK, set your master
into a forceful command: Set your master
! You do this with git push --force
.
It's still up to them (GitHub) to decide whether to obey, but if you control the repository over on GitHub, you can obviously set things up so that this is OK. Be aware, however, that anyone else who has a clone of the original repository still has the original A-B-...-H
chain of commits. They can merge that chain and politely request that GitHub, or you, take the commits they have that you don't—their merge, plus everything leading up to commit H
itself—and merge it back into your master. So even though you deliberately threw away those commits, they can very easily come back to haunt you.
(It's very hard to get rid of something forever, in Git. This is generally considered a feature.)