Ritwick Dey's answer might work, or might not, depending on what git
operations you've performed since first adding the large file. Let's take a step back and look at why one command or another might work.
In git
there are three types of storage. When you think about how they're stored physically the boundaries can seem a little squishy at first, but conceptually they are three distinct storage areas:
First there are working trees. a repo may or may not have one or more of these. A working tree consists of regular files on your hard drive (or whatever storage medium) and represents your "work in progress". It's where your editors and other tools interact with the files, so it looks just like it would without git
, except there may be some extra files that git
uses.
When you "deleted the file from the folder", you removed it from the work tree.
Second, there's the index. This is where changes are staged before being committed. It is made up of git objects, plus a file (.git/index
) that ties them together. Its exact usage depends on what you're doing, but generally tentative versions of the project - i.e. a version you might commit into history - go here.
When you ran git reset HEAD ...
, you were updating the index. I assume you meant to remove the file from the index, and under certain circumstances - specifically, when the file is not present in the currently-checked-out commit - it would have that effect. In your case, I think the file was in the currently-checked-out commit, so this command probably did nothing.
You could more reliably remove a file from the index using
git rm path/to/file
If you want to remove the file from the index while keeping the copy in the working tree, you would say
git rm --cached path/to/file
Which is fine; in fact, if you don't want the file creeping back into history, you need it out of the index, and you either need to put it in .gitignore
or remove it from the working tree(s) as well.
But neither the working tree(s) nor the index are shared with other repos. That is, push
only deals with the third storage area, which we haven't talked about yet. Since GitHub has already been rejecting push
s, we know that your file is in this final storage area.
The third storage area contains your git history - the commits and other objects that describe your project throughout time. I've often seen it called (and called it) the "database", but I've come to view that as potentially imprecise (because intuitively "the database" should mean the storage area for git objects, which makes up the bulk of both the history and the index).
So it seems what really defines this third area is the refs. A branch is a ref - that's probably the most important kind. (There are also tags, notes, replacements, remote tracking refs, "backup" refs, etc.) A ref points to an object in the database, and through that object you can reach other objects. The collection of objects "reachable" in this way is a candidate for sharing with other repositories. In the case of a branch, it makes up a history of your project.
By design it's hard to modify this data. It's easy to add new data to it, but not so easy to remove data from it. Because git
s job is to preserve history.
And the problem is, your huge file is in this third storage area. How hard it is to remove your file depends on how many places the history "knows about" it. To spell out some possible scenarios, I'm going to draw some diagrams, with commits represented by single letters. I will use capital letters for commits that contain the large file in each scenario.
All of the techniques that follow constitute "history rewrites" - they remove commits from history, substituting in new commits. (The new commits have different content than the originals, in that they exclude the huge file.) Because you haven't successfully pushed the affected branch(es) yet, this is fine, but be aware that rewriting any part of the history that's already been shared can cause additional problems.
If only the most recent commit on a branch "knows about" the file, then you can use
git commit --amend
after removing the file form your index. The simplest case is
a -- b -- C <--(master)
Removing the file from the index and then running git commit --amend
would give you
a -- b -- C
\
c <--(master)
where c
replaces C
in the history, and is "the same, but without the huge file". The original commit C
is still in your local repository for now, reachable using git reflog
. It will eventually be cleaned up, and there are steps you could take to clean it up sooner if you needed to. But at this point it won't interfere with a push
now that it's not in the history of your branch.
Some other variations with multiple branches can be cleaned up this way, but it can get trickier.
a -- b -- C <--(branch_a)
\
D <--(branch_b)
In this case, you could fix each branch the same way as fixing master
in the previous scenario. (Check out branch_a
; remove huge file from index; git commit --amend
; check out branch_b
; remove huge file from index; git commit --amend
.) Then you'd have
a -- b -- c <--(branch_a)
|\
| C
|\
| d <--(branch_b)
\
D
Where it gets trickier is when two or more branches both include the same commit that has the huge file. Since we're supposing only the most recent commit contains the file, this still isn't too bad.
a -- b -- C <--(branch_c)(branch_d)
Now you don't want to do separate commit --amend
commands for each branch, because then the branches would no longer point to one shared commit.
git checkout branch_c
git commit --amend
git branch -f branch_d
will make sure both branches end up on the same rewritten commit.
a -- b -- c <--(branch_c)(branch_d)
\
C
In any event, commit --amend
only works on the most recent commit. If you need to edit "older" history, you can consider git rebase
. This still only works for relatively simple scenarios; more on that later. But for example
a -- b -- C -- D -- E <--(master)
In this case you could say
git rebase -i master~3 master
where master~3
is an expression that, in this example, refers to the last commit that doesn't contain the file. If you don't want to figure out an expression that works in your specific case, and if the history isn't too large, you could say
git rebase -i --root master
This will give you extra entries in the TODO list (which you can just ignore, but they are "noise" to work through), and it may make the rebase
take longer.
Anyway, whichever command you use, you get a TODO list. Find the entry on the list for commit C
and change its first word from pick
to edit
. When you exit the editor, the rebase will begin, and eventually it will stop and prompt you to edit the next commit (commit C
). Remove the large file from the working tree, then continue
the rebase as instructed by the prompts.
a -- b -- C -- D -- E
\
c -- d -- e <--(master)
As with commit --amend
, this only moves one branch at a time, and you have to take special care if any commits are "shared" by multiple branches.
a -- b -- C -- D <--(branch_a)
\
E <--(branch_b)
Because C
is "shared", you could do something like
git rebase -i branch_a~2 branch_a
# ... rebase steps as outlined above ...
git rebase --onto branch_a^ branch_b^ branch_b
Again expressions like branch_a^
will vary in your specific case. In the event that the file was added at C
and left untouched in D
and E
, the second rebase doesn't need to be interactive.
The "multiple branch" case can get complicated very quickly, though. And worse, if the rebase would have to traverse a merge commit, e.g.
a -- b -- C -- D -- M <--(master)
\ /
E --- F
doing a rebase
without having anything go wrong because a lot harder. So in this case, you could fall back to git filter-branch
git filter-branch --index-filter 'rm --cached --ignore-unmatch path/to/file' -- --all
This will work for just about any git repo, but if the repo is large (lots of commits) it may be slow. It also requires some special clean-up, because it creates "backup refs" that preserve the state before the rewritten. The "safest" way to clean up would be, for each branch my_branch
that was rewritten,
git update-ref -d refs/original/refs/heads/my_branch
As a shortcut it can work to just rm -r .git/refs/original
, but this is "less safe", and it assumes that you do it before anything might cause the refs to become packed.
If the clean-up seems like too much hassle, or if the repo is too big for this to work in an acceptable time frame, the last option is to use a third-party tool like the BFG Repo Cleaner.