1

I was trying to remove some large binaries from a repo to reduce its cloning size. After researching the topic I stumbled upon the following script:

#!/bin/bash

# this script displays all blob objects in the repository, sorted from smallest to largest
# you may need `brew install coreutils --with-default-names`

git rev-list --objects --all \
| git cat-file --batch-check='%(objecttype) %(objectname) %(objectsize) %(rest)' \
| sed -n 's/^blob //p' \
| grep -vF "$(git ls-tree -r HEAD | awk '{print $3}')" \
| awk '$2 >= 2^20' \
| sort --numeric-sort --key=2 \
| gcut -c 1-12,41- \
| gnumfmt --field=2 --to=iec-i --suffix=B --padding=7 --round=nearest

Taken from https://stackoverflow.com/a/42544963/5470921 with some tweaks.

The output is something like:

0d99bb931299   44MiB other/assets.sketch
2ba44098e28f   44MiB other/assets.sketch
bd1741ddce0d   45MiB other/assets.sketch

The next step would be to remove the files unwanted. For that I used the following script:

# to remove a file (displayed path/to/file in the output)
git filter-branch --index-filter 'git rm --cached --ignore-unmatch path/to/file' --tag-name-filter cat HEAD

Taken from https://stackoverflow.com/a/46615578/5470921.

So far so good. Next I ran the following command foolishly on the master branch without making any backups:

git filter-branch --index-filter 'git rm --cached --ignore-unmatch other/assets.sketch' --tag-name-filter cat HEAD

This created a new commit named Merge remote-tracking branch 'origin/master'. After that I clicked the Sync button in the GitHub Desktop client, pushing the changes to the repo.

When running the first script again, I saw that the files are still there, they weren't removed. After further investigation, I noticed that I have double commits now in the repo.

enter image description here enter image description here

I spent a day trying to restore the repo to its old state without any luck, while doing so I deleted the local repo from my device as well, which means I no longer have the git reflog history nor do I have access to something like refs/original/refs/heads/master.

How can I restore the repo to its original case? is that still possible?

user5470921
  • 576
  • 1
  • 6
  • 16
  • Do you see the original history anywhere on the remote side? Or did you overwrite it with a force-push? – merlin2011 Aug 01 '18 at 17:06
  • It's still there, and technically I can still develop without any issues, but the commits are now doubled (any commit before I ran those scripts is now doubled), I would like to remove the duplicates and keep a one path, right now there is two – user5470921 Aug 01 '18 at 17:43
  • Sorry, I meant the original history as in a chain of commits without the duplicates. – merlin2011 Aug 01 '18 at 19:14
  • Nope, only duplicates – user5470921 Aug 01 '18 at 20:40

2 Answers2

7

Note: if this is TL;DR, skip down to the last section, How to fix it (but it will make more sense if you read the preceding ones).


What you need to understand is that git filter-branch copies commits. That is, it takes each existing commit, applies some filter or set of filters to it, and makes a new commit from the result. That's how you ended up with two sets of commits. This is necessary because it is in no one's power, especially not Git's, to change anything about any existing commit.

The filtered commits are a new history, largely independent of the original history. (Some details depend on the precise filters and commit inputs.) It's worth keeping in mind that a Git repository does not contain files, precisely; it contains commits, and the commits are the history. Each commit contains a snapshot—so in that sense, the repository does contain files, but they're one step below the overview, which is on a commit-by-commit basis.

Every commit has a unique hash ID. These are the big long ugly names you see in git log output: commit b7bd9486b055c3f967a870311e704e3bb0654e4f and so on. This unique ID serves for Git to find the commit object, and hence the files; but the hash ID itself is simply a cryptographic checksum of the full contents of the commit. Each commit lists the hash ID of its parent commit (or commits) as well, and the parent hash (and the snapshot hash) is part of the commit's contents. This is why Git can't change anything about a commit: if you take the contents, and change anything, even a single bit, and make a new commit out of that, you get a new, different hash ID, which is a new, different commit.

Since each commit contains the ID of its parent(s), this means that if we somehow tell Git—by hash ID—which commit is the newest, it can pull that commit out and use it to find the second-newest commit:

...  <--second-newest  <--newest

The second-newest points back to the third-newest, and so on. If the chain is totally linear (if there are no branches and merges), we end up with a very simple picture:

A--B--C--D--E--F--G--H   <-- master

Here, the name master remembers the actual hash ID of the latest commit, which we'll call H instead of coming up with its actual hash ID. Commit H remembers the hash ID of the previous commit G, which remembers the ID of F, and so on. Commit A is the very first commit, so it just has no parent at all, which lets the action stop.

Branching is just a matter of picking out some commit in the chain and creating a child that's not at the tip of master. For instance, suppose we leave master where it is, pointing to H, and make a new commit I on a new branch we call dev:

...--H   <-- master
      \
       I   <-- dev (HEAD)

If we then git checkout master and make a new commit J we get:

...--H--J   <-- master (HEAD)
      \
       I   <-- dev

Note that the act of putting new commits into the repository requires that we have Git change one of the names. We put new commit I in, and made Git change the name dev—which used to point to H along with master—so that dev points to (contains the hash ID of) I. Then we put new commit J in, making Git update master to point to J instead of H.

(The special name HEAD is simply attached to whichever branch-name is the one we want Git to update when we run git commit.)

Filter-branch

The filter-branch command iterates over some set of commits—often all commits, depending on how you use it; you ran it over HEAD which means the current branch, but perhaps you have only one branch name, master—and copies them. It starts by listing, in the appropriate order, every commit hash ID that is to have the copying process applied. If all you have is a linear chain (like A-B-...-H), this is those IDs in that order. Let's assume this for simplicity.

Then, for each such commit, filter-branch:

  • extracts the commit into a temporary area (or pretends to, for speed);
  • applies your filter(s);
  • uses git commit or equivalent (depending on filters, again) to make a new commit that preserves every unchanged bit, but keeps whatever changes are made.

If the new commit is 100% identical, bit-for-bit, to the original, the new hash ID is the original hash ID. Let's say that happens for A itself: there are no changes to make, so Git re-uses the ID. The repo contents now look like this:

A--B--C--D--E--F--G--H   <-- [original master]
 .
  ...<-- [new master, being built]

Then Git moves on to the next commit hash ID in the list, which is B. Let's say that the filter makes some change this time (removing a big file), so that the new commit has a new, different hash ID, which we'll call B':

A--B--C--D--E--F--G--H   <-- [original master]
 \
  B'  <-- [new master, being built]

Filter-branch moves on to C. Even if it has no change to make to C's snapshot, filter-branch is forced to make one change now: it must make a new C' whose parent is B', because something happened to B. So now we get C':

A--B--C--D--E--F--G--H   <-- [original master]
 \
  B'-C'  <-- [new master, being built]

This repeats for all the remaining commits. All of them get new hash IDs, maybe in part because something in the snapshot changed, but certainly because their parent hash also changed. At the end, git filter-branch rewrites the name master itself to point to the final copied commit, H':

A--B--C--D--E--F--G--H   <-- [original master, now in refs/original/]
 \
  B'-C'-D'-E'-F'-G'-H'  <-- master

All of this happens purely in your local repository—no other Git, no clone of the original repository, knows that any of this has occurred.

(Note that if you do multiple filter-branch operations, each one copies the chain of commits. Some of the intermediate results may be of no real value. Git will eventually garbage collect the unused and unreachable commits, typically after about a month. Since filter-branch copies things, you will see space usage increase a bit, rather than decrease, until the eventual garbage collection and subsequent rebuilding of pack files.)

Where things went wrong

Where things went wrong is definitely not where you think; I think the problem most likely occurred here:

After that I clicked the Sync button in the GitHub Desktop client

I have never used the GitHub Desktop software, so I can't be certain of what it does when. But this is most likely when:

[something] created a new commit named Merge remote-tracking branch 'origin/master'

because git filter-branch does not do that—well, not unless you write a very complicated filter. What does do that is git merge: you connect to another Git, which still has the original A-B-...-H sequence, your Git sets your origin/master to remember their H, and your Git runs a merge that connects their H to your H':

A--B--C--D--E--F--G--H   <-- origin/master
 \                    \
  B'-C'-D'-E'-F'-G'-H'-I  <-- master

where I is a merge commit that has two parents.

How to fix it

What you'll need to do, now that the only copies of the repository you have are the "dual commits" version, is:

  • Start with that dual version.

  • Use git branch -f or git reset --hard to move at your branch name(s) to point to some commit before the merge that joins up the two separate histories.

Assuming you have only one master and that you have that checked out now, git reset is the way to go. (You can only use git branch -f on branches that do not have HEAD attached. You can only use git reset on branches that do have HEAD attached.) Find the commit you want to retain, i.e., the filtered one, which will be the first parent of the merge commit, and tell Git to make the name master point to that commit, abandoning the merge. Note that this will lose any unsaved work; and this also assumes you have not made any commits atop the merge:

$ git reset --hard HEAD~1   # or HEAD^

Now the picture looks more like this:

A--B--C--D--E--F--G--H   <-- origin/master
 \
  B'-C'-D'-E'-F'-G'-H'  <-- master

which is basically the same as what you had after the series of git filter-branch commands: the only real difference is that we're showing the name origin/master as the way your Git finds commit H. (The Git over on origin is using its name master to find commit H in its repository. Your Git is remembering their master as your origin/master.)

If everything now looks good, your remaining job is to convince their Git—the one over at origin—to take your new chain of commits and to move their name master so that it points to commit H', the final corrected copy you made of your original H. To do that, you will use git push. However...

If you just run git push origin master to send them your copies and request that they change their master to point to commit H' instead of commit H, they will say no. Making that change would cause their Git to "forget" or "abandon" commit H, which would lose commit G, which would lose commit F, and so on, all the way back to whichever commit(s), if any, you retained. But you can change your polite request, Please, if it's OK, set your master into a forceful command: Set your master! You do this with git push --force.

It's still up to them (GitHub) to decide whether to obey, but if you control the repository over on GitHub, you can obviously set things up so that this is OK. Be aware, however, that anyone else who has a clone of the original repository still has the original A-B-...-H chain of commits. They can merge that chain and politely request that GitHub, or you, take the commits they have that you don't—their merge, plus everything leading up to commit H itself—and merge it back into your master. So even though you deliberately threw away those commits, they can very easily come back to haunt you.

(It's very hard to get rid of something forever, in Git. This is generally considered a feature.)

torek
  • 448,244
  • 59
  • 642
  • 775
  • Thank you so much for putting the time and effort into explaining things, I really appreciate this. I will go through what you wrote again, attempt a fix and get back to with the results. – user5470921 Aug 01 '18 at 20:42
  • Thank you again, indeed `git reset` does the trick, however as you pointed out it discarded any commits that came up after branching (where `git filter-branch` effect begins), is it possible to keep the latter commits? cherry picking maybe? – user5470921 Aug 02 '18 at 15:08
  • Also just to double check, should I rest to the commit exactly before the `Merge remote-tracking branch 'origin/master'` or to where from the branching starts, which is much earlier in the tree – user5470921 Aug 02 '18 at 15:20
  • You'll want to go back to just-before-the-merge, so as to retain the filtering (those are the copied commits), and then yes, you'll need to cherry-pick the remaining commits to copy them as well. Set a branch or tag name to point to the commit to which `master` points now, so that you and Git can continue to find the subsequent commits after you use `git reset` to move `master` to the just-before-merge commit. – torek Aug 02 '18 at 15:34
  • A soft reset brings back the tree to a straight path and keeps the latter changes, which is the desired result, however it mushes them into a single commit, it would be great if we can in some way have the latter commits appended to original branch. Another question, if I do a hard reset and `git push -f`, will that reduce the number of commits? – user5470921 Aug 02 '18 at 15:34
  • Note that a `--soft` reset retains the index and work-tree content. This means you can make a *new* commit from the index; that's why you see the single squashed commit. When you run `git push -f origin master` you send any new commits you have that they (origin's Git) don't (maybe some, maybe none, depending on what you and they have), and then you ask them to set *their* `master` to point to the same commit as your `master`. If that causes them to drop some commit(s), that's why you need the `-f`. – torek Aug 02 '18 at 15:36
  • Hey @torek, I managed to get all things back on track, I had some conflicts here and there but fixed them along the way. One last detail, if we can get this it would be perfect. Is there any way of keeping the original dates of the commits while cherry picking? Thanks for all your help. – user5470921 Aug 02 '18 at 22:05
  • Note that I cherry picked with `-x` flag. – user5470921 Aug 02 '18 at 22:17
  • I think cherry-pick should preserve the author-date by default, but if not, you can provide environment variables (`GIT_AUTHOR_DATE` and `GIT_COMMITTER_DATE`) to force the author and/or committer dates. This is a little bit tricky since you have to extract the values from the existing commits. The filter-branch script does this in order to copy commits, so that's where to look. – torek Aug 02 '18 at 22:32
  • It does preserve it, I am interested in having the committer date be the same as the author, I'm looking into applying GIT_COMMITTER_DATE onto a range of cherry picked commits now – user5470921 Aug 02 '18 at 22:52
  • I've added an answer below detailing the exact commands needed to fix this, thanks again for your help. – user5470921 Aug 03 '18 at 16:44
0

Based on @torek's answer, here are the steps I will be taking to fix this issue, I will execute this later today and update this answer with the results -or edits if any- just for reference.

# make sure the current branch is the one with the duplicates, in this case it's `master`
git checkout master

# double check you are on `master`
git status

# create a new branch from `master`
git checkout -b fix-duplicates

# double check you are on `fix-duplicates`
git status

# .. -A-B- .. -C-D-E- .. -F
#      \        /
#       B- .. -C

# A = aaaaaaaa, branching starts
# B = bbbbbbbb, branching takes effect (one commit after where it started in A)
# C = cccccccc, branching ends (exclude the merge commit that cause duplicates D)
# E = eeeeeeee, one commit after the merge commit
# F = ffffffff, most recent commit

# move back to the point where the branching started
git reset --hard A

# 1) to cherry pick with new commit dates
# cherry pick all commits from where the branching started up to where the branching ends
# exclude the merge commit at the top (the one that caused the duplication)
git cherry-pick B..C

# cherry pick all commits after the the merge up to most recent commit
git cherry-pick E..F

# 2) if you want to keep the original dates, run the following scripts instead
for commit in $(git rev-list B..C)
do
    export GIT_COMMITTER_DATE=$(git log -1 --format='%at' $commit)
    git cherry-pick $commit
done

for commit in $(git rev-list E..F)
do
    export GIT_COMMITTER_DATE=$(git log -1 --format='%at' $commit)
    git cherry-pick $commit
done

# make sure the fix is good by comparing the two branches, they should be identical
git diff master..fix-duplicates

# make the fixed branch the new `master`
git checkout master
git reset --hard fix-duplicates

# review what you did (optional)
git reflog

# forcefully push the changes (make sure everything is right before this step!)
git push -f origin master
user5470921
  • 576
  • 1
  • 6
  • 16