How do I totally remove a specific version of a file from a git repository?

Question

I'm working on a couple of game development projects that involve lots of changes to code and large binary files at the same time. For the sake of simplicity, let's say I have a git repository with 2 files (a text file and a large binary blob) in it that are both updated across multiple commits:

commit dddd: "Release day is finally here!" <tag: v1.0>
   changed hello.md
   changed image.png (lfs) <==== keeper!

commit cccc: "Ok, that's a bit better."
   changed hello.md
   changed image.png (lfs)

commit bbbb: "Updated my project."
   changed hello.md
   changed image.png (lfs)

commit aaaa: "Initial commit!" 
   added hello.md
   added image.png (lfs) <==== keeper!

Each commit I've made some kind of change to both of my files.

But, in retrospect, I've decided that I want to get rid of some lfs files to reduce the overall size of my repository, and only half of the versions of image.png are different enough to be worth keeping. (Keep in mind, it's not always as simple as not committing the intermediate versions, since we don't always know what the 'key' versions are without hindsight.)

So, can I completely remove the versions of image.png included in bbbb and cccc from my repository to reduce it's overall storage footprint? How? I've been looking into git gc and git filter-repo but I've been having trouble achieving what I want to do. Am I on the right track? Are there any other strategies that I can use to optimize the size of my repository or otherwise mitigate this situation?

Can you live with a new history that replaces all existing commits? It cannot be used together with the old history and you can no longer merge/rebase/work with others (unless everybody updates their repository to the new repository) — knittl, Mar 06 '21 at 10:11
Yeah. I think so. This would be the kind of housecleaning task that might happen a maybe a couple times a year with the understanding that the history will be changed and --force pushed. — Donuts, Mar 06 '21 at 18:26
I have to think a bit more about this, but it might be doable with `filter-branch` and some clever decision logic which versions to keep (`rebase -i` could also work for a history without branching, but it requires more manual action). For your commits `bbbb` and `cccc`: which version of they file should they contain after your cleanup – the one from `aaaa` or the one from `dddd`? — knittl, Mar 07 '21 at 12:31
(By the way... It could be that this is kind of a flawed approach anyway, but the main problem I'm interested in addressing here is managing the size of our repository, especially with regard to LFS assets. Please let me know if I'm barking up the wrong tree and there's a simpler/smarter way of doing this.) — Donuts, Mar 08 '21 at 18:25

knittl · Answer 1 · 2021-03-21T12:01:38.623

Thanks for the interesting question! I finally came around to poke a bit at Git. Here would be my (untested) idea how to handle this. I used the git.git repository to experiment. It does not contain LFS data, but hopefully it will get you started in the right direction :)

git rev-list --oneline --objects --in-commit-order HEAD -- path/to/file

Outputs a list of commits, trees, and blobs; e.g.:

cf1b7869f0 Commit message here
b299d53c5f9a2a8be72f819e26f49421ed6c45bc 
52c10caf3523b877ef7fa77f7af3c64de3055b4f path/to/file

Combined with grep, this lets you extract all blob ids (hashes) of your file in question:

git rev-list --oneline --objects --in-commit-order HEAD -- path/to/file \
  | grep 'path/to/file$'

Now, you have to identify which blobs you want to keep/delete. Maybe some clever sed magic can help you or by providing the correct commit ranges to rev-list. So instead of HEAD to list all commits reachable, maybe just do v1..v3 or similar (--since and --until might be helpful too). In the worst case, you have to do that manually.

Now, make sure to have a backup of your repository!! (Can't stress this enough). Best to create a new clone in a separate directory.

git-filter-repo seems to come with a content based filter which provides the --strip-blobs-with-ids option.

Store all blob ids (i.e. hashes identifying a specific version of your files) which you want to remove into a text file, line by line. Then feed this file to the content filter of filter-repo. If it does, what is states in the manual, you should be left with only the blobs you wanted to keep.

As a next step, you probably want to remove the file from LFS itself, not only the reference from your commits: How to delete a file tracked by git-lfs and release the storage quota?

I hope that helps. Let me know how—and if—it worked out.

How do I totally remove a specific version of a file from a git repository?

1 Answers1