Note: there's a similar question How to keep commit hashs not change when use git filter-repo rewrite the history but the answers focus on how Git cannot do that. In this question, I'd like to explore whether, in theory, it's possible to write a custom script to keep the commit hashes.
Git filter-branch
and BFG Repo-Cleaner are two popular tools to remove large files and other things from a repo history. They lead to different commit SHAs / hashes, which is how Git works as it "fingerprints" the contents of the commit, its parents etc.
However, we're in a situation where the unfortunate large file commit happened a while ago and we have all sorts of references to newer commits e.g. in GitHub issues ("see commit f0ec467") and other external systems. If we used filter-branch or BFG, lot's of things would break.
So I came here to ask whether there's some dirty, low-level trick how to keep commit IDs / SHA-1 even for rewritten commits. I imagine that for a bad commit that we want to rewrite, a custom script would create a new Git object but "hardcoded" the same / old SHA-1, skipping the calculation of it. The newer commits (its children / descendants) should continue working I think (?!).
If this couldn't work, I'd like to understand why. Does Git check that hashes agree with the actual contents regularly? Does it do so only during some operations, like gc
or push or pull?
(I know this is a very thin ice, I'm just technically exploring our options before we accept that we'll have a large binary in our repo forever, with all the implications like having much larger backups forever, full clones taking longer, etc.)
UPDATE: There's now an accepted answer but at the same time, no answer mentions git replace
which might be the solution to this? I've done some basic experiments but am not sure yet.