0

Context: I'm trying to remove some files from git because I was saving the checkpoints of my machine learning model on it, but as I do hyper parameter optimization with optuna and save the checkpoints for every trial by creating a separate directory inside the directory checkpoints for each one (this is something I may change on the code because there are too much files and I only need the best trial) it exceeded git limit when pushing. The first occurrence of the checkpoints folder is 14 commits ago and I already pushed before, but now it's exceeding gits size limit

Problem: I can't remove the files from the repository. I tried the following:

  1. git reset --hard @~14
    git rm -r --cached path/to/checkpoints
    git commit --ammend
    git reset --hard last_commit
    
    
  2. git reset --hard @~14
    git rm -rf --cached --ignore-unmatch path/to/checkpoints
    git commit --ammend
    git reset --hard last_commit
    
    
  3. git filter-branch --index-filter "git rm -rf --cached --ignore-unmatch path/to/checkpoints" HEAD
    
    

Result: When I do git reset --hard @~14 the checkpoints folder is still there and when I do git push --force origin master it doesn't work and I think it's still the size limit as I couldn't remove the files but now the connection fails (I already tried changing gits post buffer to see whether it solves the connection problem)

1 Answers1

0

If you need to remove large files from existing commits, the only one of the three sequences of commands you show above that can work is the third one (using git filter-branch). The reason for this is that:

git reset --hard <last-commit-hash-ID>

restores the old commits that you tried to undo, so methods #1 and #2 do some work and then throw out the work done and put you back into the bad state you had before.

What you haven't mentioned is the actual problem. You said:

it exceeded git limit when pushing

I'm not sure what the pronoun it here refers to. Git's own internal limits are in the gigabytes (old versions of Git) and more-than-terabytes (newer versions), though, so this cannot be a Git limit. Perhaps you are referring to a GitHub limit: Repository size limits for GitHub.com. Or perhaps you mean some other limit.

It's worth noting that GitHub will, by default at least, never discard any commits, even those not reachable from any reference name. (This is because GitHub will share storage between forks. They don't keep track of which forks might be sharing which internal Git objects; instead, they assume that if some Git object $obj exists in your repository, it might be in use by some fork, and therefore $obj can never be discarded even if your fork no longer uses it. In theory, GitHub could run a mass GC over all forks that share underlying repositories to correct this, but that might cost more than it saves.)

In any case, there are many solutions to clearing up large files, including the filter-branch method you mentioned, the newfangled git filter-repo, and of course the old standby called The BFG.

torek
  • 448,244
  • 59
  • 642
  • 775
  • thank you! About when I said "it exceeded git limit" you're right I meant gits repository size limit – Luiz Felipe de Barros Jordao C Apr 08 '22 at 14:25
  • it worked with ```git filter-branch -f --prune-empty --index-filter "git rm -rf --cached --ignore-unmatch path/to/checkpoints" HEAD``` although there is a warning saying that there are still large files, but this warning refers to another file. Again thank you for your help! – Luiz Felipe de Barros Jordao C Apr 08 '22 at 14:46