1

I read many linked questions but I have the following problem.

In this repo, there were large files in media/1 Juno-Trumpet (in previous commits), so I followed exactly the answer here to delete these files:

git clone https://github.com/alexmacrae/SamplerBox.git
git count-objects -vH

Total filesize: 54MB

git filter-branch --tree-filter 'rm -rf "media/1 Juno-Trumpet"' --prune-empty HEAD
git for-each-ref --format="%(refname)" refs/original/ | xargs -n 1 git update-ref -d
echo "media/1 Juno-Trumpet/" >> .gitignore
git add .gitignore
git commit -m 'Removing a folder from git history'
git gc
git count-objects -vH

Total filesize: 54MB

Question: Why hasn't the repo's size changed? How to make the repo size smaller after such a cleanup?

Community
  • 1
  • 1
Basj
  • 41,386
  • 99
  • 383
  • 673
  • Did you verify if the folder actually disappeared? – Lasse V. Karlsen May 19 '17 at 11:18
  • It wasn't there in the last commits (this folder is old in history), so it's not shown anymore in the files @LasseV.Karlsen. Or is there a precise way to verify this, including in past commits? – Basj May 19 '17 at 11:22
  • Other than checking out a commit that had the folder previously, none that I know of, but to be honest I think the answer by lucanLepus is what you want. – Lasse V. Karlsen May 19 '17 at 11:23

3 Answers3

3

Running git filter-branch actually copies every commit that is filtered. The resulting repository is never any smaller—well, not yet—and is usually larger. If you are lucky or clever, most of the copies re-use most of the original objects, so that the resulting repository is only a little bit bigger than the original.

You might reasonably ask: "Then why should we ever filter a repository?" And in fact, mostly you shouldn't: it's a big headache (but usually just a one-time one, at least) for everyone using the repository, as they all have to switch over to the new filtered repository. But the real answer is that after filtering, you can remove the references to the original (pre-copying) objects, or clone the repository to a new fresh clone.

The original objects' references are saved in refs/original/ and in reflogs (in particular the HEAD reflog will usually have them). See the instructions at the end of the git filter-branch documentation for how to remove those, if you choose (for some crazy reason) not to just re-clone the filtered repository.

torek
  • 448,244
  • 59
  • 642
  • 775
  • Thanks for your answer. What `git` command(s) do you recommend to finish the process? – Basj May 19 '17 at 11:12
  • Best is to `git clone` the thing, using `file://` (but pay attention to branches, since a clone of a clone picks up only the local branches, not any remote-tracking branches). Or use the `git reflog expire` method shown on the man page and in lucanLepus' answer. Note that `--aggressive` was implemented poorly originally, fixed once to be better, and only relatively recently fixed again to be sensible. – torek May 19 '17 at 11:16
  • I still can't manage to make it work: [see here](http://stackoverflow.com/a/44068990/1422096). Any idea @torek? – Basj May 19 '17 at 11:35
  • Best guess at this point is that you have other branch(es) or tags or other references that reach, and therefore retain, the old commits. Note that even a stray `git stash` can do this. Re-cloning discards stashes, which is usually a good idea, especially if you use `--prune-empty` combined with `--all` since that tends to wreck stashes. – torek May 19 '17 at 11:39
  • Thanks. Wow this seems quite complex for me now @torek. Could you edit the answer or pastebin what you have in mind, in a similar way [than this](http://stackoverflow.com/a/44068990/1422096) ? – Basj May 19 '17 at 11:42
  • I updated the `git filter-branch` line. I'd also suggest using the `--index-filter` for this case, as it is significantly faster ... just replace tree filter with index filter and use `git rm --cached --ignore-unmatch` instead of `rm -f`. – torek May 19 '17 at 11:48
  • Thanks @torek. I tried with your version (on community wiki answer), but still doesn't work : size is still the same. Do you think you can try on [this repo](https://github.com/alexmacrae/SamplerBox.git)? Thanks in advance! – Basj May 19 '17 at 13:09
  • Aha, there are several issues, including the fact that these files are named both `media/1 Juno-Trumpet/*` *and* `media/Juno-Trumpet/*`. – torek May 19 '17 at 14:34
  • ... and one more name. The cleanup worked after I removed all of them: see edit. – torek May 19 '17 at 14:49
1

the old commits, still containing the subdirectory are still part of the repository, even though they are not reachable from any branch.

to clean them up you could do

git reflog expire --expire=now --all && git gc --prune=now --aggressive

this will however empty your reflog. that's necessary because commits referenced by your reflog will not be garbagecollected

lucanLepus
  • 161
  • 1
  • 7
  • Since the OP removed the `refs/original/` references, this is the right answer here (hence upvoted), but I would not use `--aggressive` unless your Git is pretty recent (see http://stackoverflow.com/a/28720432/1256452). – torek May 19 '17 at 11:14
  • I accepted, because I think it will work, but it's still not working, see http://stackoverflow.com/a/44068990/1422096. Any idea? – Basj May 19 '17 at 11:34
0

Just a ready-to-use full version, based on @lucanLepus's accepted answer.

Let's say I am userA, and I want to totally remove folder from history media/1 Juno-Trumpet/ (which is not present anymore in latest commits, but in far past commits) from the repo on Github.

NB: this particular repository has original branches master, sfz, and wifi, and tag v1.0. To avoid needing to know this, I use a mirror clone here (which makes a bare repository, which is fine since I will use an index filter). Then, since this is GitHub, I toss all the refs/pull/ refs first.

As it turns out, the files are also named media/Juno-Trumpet/ and media/Juno/, so we need to remove all three path names.

git clone --mirror https://github.com/alexmacrae/SamplerBox.git
cd SamplerBox.git
git for-each-ref --format="git update-ref -d %(refname)" refs/pull | sh
git for-each-ref         # to check that we have only wanted refs left
git count-objects -vH    # size-pack: 54.40 MiB
git filter-branch --index-filter 'git rm -r --cached --ignore-unmatch "media/1 Juno-Trumpet" media/Juno-Trumpet media/Juno' --prune-empty --tag-name-filter cat -- --all

The filter-branch step takes a short while and ends with:

Ref 'refs/heads/master' was rewritten
Ref 'refs/heads/sfz' was rewritten
Ref 'refs/heads/wifi' was rewritten
WARNING: Ref 'refs/tags/v1.0' is unchanged
v1.0 -> v1.0 (7ec3254d08b65fd3ca8a048cef60b5b2c75f7e11 -> 7ec3254d08b65fd3ca8a048cef60b5b2c75f7e11)

(This last line indicates that the one tag in the repository comes before any of the rewritten commits, i.e., we did not need --tag-name-filter cat after all.)

Now we must remove the refs/original/ names. Since this is a fresh clone, there are no reflogs to expire, but we'll do that anyway, and then repack with git gc:

git for-each-ref --format="git update-ref -d %(refname)" refs/original | sh
git reflog expire --expire=now --all
git gc --prune=now --aggressive
git count-objects -vH     # size-pack: 1.41 MiB

I have not done this last step:

git push origin '+refs/*:refs/*'

(and if you're really sure you want all the media files totally gone, you might want to clean out all the pull requests as well, since they will retain them for a while otherwise).


Incidentally, I found the files under the three names using:

git cat-file --batch-all-objects --batch-check | sort +2 -rn | head

to find relatively large files, followed by:

git rev-list --all | while read ref; do
   git ls-tree -r $ref | grep 477145c7d0190f4e0aeea0f7bfb9accbf2c1ba48;
done | sort -u

(477145c7d0190f4e0aeea0f7bfb9accbf2c1ba48 is one of the big .wav files. I did not check to see whether all the files removed are .wav files and whether any other .wav files remain.)

torek
  • 448,244
  • 59
  • 642
  • 775
Basj
  • 41,386
  • 99
  • 383
  • 673