How to remove big file from repo without losing history

Question

I had a large mp3 file in my folders when i tried to push it up to github and was rejected. I have since deleted the file from the folder. I tried to push the code up again after deleting the file, and am getting this message:

client/build/audio/Celtic Music - Ancient Forest _ 3 hours of celtic fantasy music (192  kbps) (TubeMp3Convert.com).mp3 is 252.71 MB; this exceeds GitHub's file size limit of 100.00 MB

I then tried to run this command:

$ git reset HEAD 'client/build/audio/Celtic Music - Ancient Forest _ 3 hours of celtic fantasy music (192  kbps) (TubeMp3Convert.com).mp3'

I tried to push it up again afterwards and am still getting the same error as before. I am not sure how else to unstage this file. I don't want to do a revert because I don't want to lose all of the work I have done since then. Please help, I don't know what to do from here. I can't push any work up to github until this issue is resolved.

In the linked question you can look for `Interactive rebase` in the second answer, it's simpler and seems enough for your case. — Frax, May 07 '18 at 08:04
The deleted file still takes room if somewhere in history. You must get rid of all commits containing the file. Fortunately git is putty so you can. — Thorbjørn Ravn Andersen, May 07 '18 at 08:53

Mark Adelsberger · Answer 1 · 2018-05-07T14:30:57.727

Ritwick Dey's answer might work, or might not, depending on what git operations you've performed since first adding the large file. Let's take a step back and look at why one command or another might work.

In git there are three types of storage. When you think about how they're stored physically the boundaries can seem a little squishy at first, but conceptually they are three distinct storage areas:

First there are working trees. a repo may or may not have one or more of these. A working tree consists of regular files on your hard drive (or whatever storage medium) and represents your "work in progress". It's where your editors and other tools interact with the files, so it looks just like it would without git, except there may be some extra files that git uses.

When you "deleted the file from the folder", you removed it from the work tree.

Second, there's the index. This is where changes are staged before being committed. It is made up of git objects, plus a file (.git/index) that ties them together. Its exact usage depends on what you're doing, but generally tentative versions of the project - i.e. a version you might commit into history - go here.

When you ran git reset HEAD ..., you were updating the index. I assume you meant to remove the file from the index, and under certain circumstances - specifically, when the file is not present in the currently-checked-out commit - it would have that effect. In your case, I think the file was in the currently-checked-out commit, so this command probably did nothing.

You could more reliably remove a file from the index using

git rm path/to/file

If you want to remove the file from the index while keeping the copy in the working tree, you would say

git rm --cached path/to/file

Which is fine; in fact, if you don't want the file creeping back into history, you need it out of the index, and you either need to put it in .gitignore or remove it from the working tree(s) as well.

But neither the working tree(s) nor the index are shared with other repos. That is, push only deals with the third storage area, which we haven't talked about yet. Since GitHub has already been rejecting pushs, we know that your file is in this final storage area.

The third storage area contains your git history - the commits and other objects that describe your project throughout time. I've often seen it called (and called it) the "database", but I've come to view that as potentially imprecise (because intuitively "the database" should mean the storage area for git objects, which makes up the bulk of both the history and the index).

So it seems what really defines this third area is the refs. A branch is a ref - that's probably the most important kind. (There are also tags, notes, replacements, remote tracking refs, "backup" refs, etc.) A ref points to an object in the database, and through that object you can reach other objects. The collection of objects "reachable" in this way is a candidate for sharing with other repositories. In the case of a branch, it makes up a history of your project.

By design it's hard to modify this data. It's easy to add new data to it, but not so easy to remove data from it. Because gits job is to preserve history.

And the problem is, your huge file is in this third storage area. How hard it is to remove your file depends on how many places the history "knows about" it. To spell out some possible scenarios, I'm going to draw some diagrams, with commits represented by single letters. I will use capital letters for commits that contain the large file in each scenario.

All of the techniques that follow constitute "history rewrites" - they remove commits from history, substituting in new commits. (The new commits have different content than the originals, in that they exclude the huge file.) Because you haven't successfully pushed the affected branch(es) yet, this is fine, but be aware that rewriting any part of the history that's already been shared can cause additional problems.

If only the most recent commit on a branch "knows about" the file, then you can use

git commit --amend

after removing the file form your index. The simplest case is

a -- b -- C <--(master)

Removing the file from the index and then running git commit --amend would give you

a -- b -- C
      \
       c <--(master)

where c replaces C in the history, and is "the same, but without the huge file". The original commit C is still in your local repository for now, reachable using git reflog. It will eventually be cleaned up, and there are steps you could take to clean it up sooner if you needed to. But at this point it won't interfere with a push now that it's not in the history of your branch.

Some other variations with multiple branches can be cleaned up this way, but it can get trickier.

a -- b -- C <--(branch_a)
      \
       D <--(branch_b)

In this case, you could fix each branch the same way as fixing master in the previous scenario. (Check out branch_a; remove huge file from index; git commit --amend; check out branch_b; remove huge file from index; git commit --amend.) Then you'd have

a -- b -- c <--(branch_a)
     |\
     | C
     |\
     | d <--(branch_b)
      \
       D

Where it gets trickier is when two or more branches both include the same commit that has the huge file. Since we're supposing only the most recent commit contains the file, this still isn't too bad.

a -- b -- C <--(branch_c)(branch_d)

Now you don't want to do separate commit --amend commands for each branch, because then the branches would no longer point to one shared commit.

git checkout branch_c
git commit --amend
git branch -f branch_d

will make sure both branches end up on the same rewritten commit.

a -- b -- c <--(branch_c)(branch_d)
      \
       C

In any event, commit --amend only works on the most recent commit. If you need to edit "older" history, you can consider git rebase. This still only works for relatively simple scenarios; more on that later. But for example

a -- b -- C -- D -- E <--(master)

In this case you could say

git rebase -i master~3 master

where master~3 is an expression that, in this example, refers to the last commit that doesn't contain the file. If you don't want to figure out an expression that works in your specific case, and if the history isn't too large, you could say

git rebase -i --root master

This will give you extra entries in the TODO list (which you can just ignore, but they are "noise" to work through), and it may make the rebase take longer.

Anyway, whichever command you use, you get a TODO list. Find the entry on the list for commit C and change its first word from pick to edit. When you exit the editor, the rebase will begin, and eventually it will stop and prompt you to edit the next commit (commit C). Remove the large file from the working tree, then continue the rebase as instructed by the prompts.

a -- b -- C -- D -- E
      \
       c -- d -- e <--(master)

As with commit --amend, this only moves one branch at a time, and you have to take special care if any commits are "shared" by multiple branches.

a -- b -- C -- D <--(branch_a)
           \
            E <--(branch_b)

Because C is "shared", you could do something like

git rebase -i branch_a~2 branch_a
# ... rebase steps as outlined above ...
git rebase --onto branch_a^ branch_b^ branch_b

Again expressions like branch_a^ will vary in your specific case. In the event that the file was added at C and left untouched in D and E, the second rebase doesn't need to be interactive.

The "multiple branch" case can get complicated very quickly, though. And worse, if the rebase would have to traverse a merge commit, e.g.

a -- b -- C -- D -- M <--(master)
           \       /
            E --- F

doing a rebase without having anything go wrong because a lot harder. So in this case, you could fall back to git filter-branch

git filter-branch --index-filter 'rm --cached --ignore-unmatch path/to/file' -- --all

This will work for just about any git repo, but if the repo is large (lots of commits) it may be slow. It also requires some special clean-up, because it creates "backup refs" that preserve the state before the rewritten. The "safest" way to clean up would be, for each branch my_branch that was rewritten,

git update-ref -d refs/original/refs/heads/my_branch

As a shortcut it can work to just rm -r .git/refs/original, but this is "less safe", and it assumes that you do it before anything might cause the refs to become packed.

If the clean-up seems like too much hassle, or if the repo is too big for this to work in an acceptable time frame, the last option is to use a third-party tool like the BFG Repo Cleaner.

How to remove big file from repo without losing history

1 Answers1