8

Note: there's a similar question How to keep commit hashs not change when use git filter-repo rewrite the history but the answers focus on how Git cannot do that. In this question, I'd like to explore whether, in theory, it's possible to write a custom script to keep the commit hashes.

Git filter-branch and BFG Repo-Cleaner are two popular tools to remove large files and other things from a repo history. They lead to different commit SHAs / hashes, which is how Git works as it "fingerprints" the contents of the commit, its parents etc.

However, we're in a situation where the unfortunate large file commit happened a while ago and we have all sorts of references to newer commits e.g. in GitHub issues ("see commit f0ec467") and other external systems. If we used filter-branch or BFG, lot's of things would break.

So I came here to ask whether there's some dirty, low-level trick how to keep commit IDs / SHA-1 even for rewritten commits. I imagine that for a bad commit that we want to rewrite, a custom script would create a new Git object but "hardcoded" the same / old SHA-1, skipping the calculation of it. The newer commits (its children / descendants) should continue working I think (?!).

If this couldn't work, I'd like to understand why. Does Git check that hashes agree with the actual contents regularly? Does it do so only during some operations, like gc or push or pull?

(I know this is a very thin ice, I'm just technically exploring our options before we accept that we'll have a large binary in our repo forever, with all the implications like having much larger backups forever, full clones taking longer, etc.)


UPDATE: There's now an accepted answer but at the same time, no answer mentions git replace which might be the solution to this? I've done some basic experiments but am not sure yet.

Borek Bernard
  • 50,745
  • 59
  • 165
  • 240
  • 3
    The only way is to break the SHA-1 algorithm. See [a related question](https://stackoverflow.com/q/42433126/1256452). – torek Oct 05 '20 at 08:16
  • It might be possible to _hack_ Git to achieve what you want, but I don't recommend it. Git enforces creating a new SHA-1 for a rewritten commit for auditing purposes, so it is clear that some work has been done there. – Tim Biegeleisen Oct 05 '20 at 08:16
  • @TimBiegeleisen Do you know any specifics about it? I'd also guess it would cause _some_ issues in _some_ cases but I'd like to know when exactly does Git verify that SHA-1's are actually correct. – Borek Bernard Oct 05 '20 at 08:21
  • "*If we used filter-branch or BFG, lot's of things would break.*" Would they? Presumably no code would break, just references in comments and issues. – Schwern Oct 05 '20 at 08:45
  • Git verifies the contents when you transfer commits between repositories using `git push` or `git pull`. Also, `git fsck` can be run manually, and it will complain loudly if there is a mismatch. – j6t Oct 05 '20 at 08:52
  • 1
    What you try to achieve is exactly what hash function like SHA-1 is designed for not allowing you to do. So you are trying to solve a problem mush more difficult than the one you could have if you rewrite the history. Find the best way to build a matching table between old and new commit hashes and your temporary problem is "solved" – Philippe Oct 05 '20 at 09:43
  • @Schwern Correct, the code itself wouldn't be affected (of course) but everything referencing specific commits would, e.g., CI builds, version numbers of our build artifacts, references from wiki, issue tracker, Slack. On an active project, even one week old "bad commit" is hard to get rid of. – Borek Bernard Oct 05 '20 at 09:44

2 Answers2

7

I included a link as a comment, but in fact, breaking SHA-1 doesn't help very much.

The problem is that Gits exchange objects by comparing object hash IDs. These are currently SHA-1 (see the other question and its answer for some future possibilities). If you manage to break SHA-1, and produce a new input object that generates the same hash ID, you could:

  • rip the old object out of your Git's object database, then
  • insert the new object into your Git's database

and from then on, your Git would see only the new object, instead of the old one. But when you connect your Git to some other Git, and your Git says to that other Git: I have object a123456..., would you like it? the other Git might just answer: No thanks, I already have that one. They have the old one, of course. So you've made your Git incompatible with their Git, but gained nothing from this.

If the other Git doesn't have the object in question, well, then you're OK! They will ask for your copy and you can hand that over.

Commit and tag objects have room in them for somewhat-arbitrary (not completely arbitrary) user data. This is where you would put your perturbable data for breaking SHA-1. Tree objects are less friendly, but as long as you can do what you need to with commit and tag objects, you can probably bypass this.

As for where to get the compute power, well, the price of a large group of Raspberry Pi computers is coming down....

Edit: I forgot to address this question:

Does Git check that hashes agree with the actual contents regularly?

Yes. In fact, it does this check every time it extracts an object by its hash ID. Remember that the bulk of most repositories is the object database, which is a simple key-value store. The key is the hash ID and the data stored under that key represent the object. Git uses the key to do the lookup, then verifies that the stored data hash to that key, to make sure the stored data were not corrupted by a disk or memory error.

torek
  • 448,244
  • 59
  • 642
  • 775
  • That's a very good point about the exchange. I guess what I'm looking for would need to be somehow directly supported by Git, e.g., "moving large files to LFS while keeping the commit hashes _somehow_" would need to be a supported scenario. I think it's theoretically possible but not right now, that's for sure. – Borek Bernard Oct 05 '20 at 08:28
  • About the other point – why to spend all the compute on finding a collision? Isn't it possible to just create a Git object file with the right hash? (I know it's not possible via a Git client, but objects use a documented file format so I don't see a reason why such object couldn't be created manually.) – Borek Bernard Oct 05 '20 at 08:29
  • The hash of the object is an SHA-1 checksum of the bytes that make up the object. Compute the SHA-1 of a few strings of bytes. See if you can come up with a way to compute a target SHA-1: what bytes do you need to feed it? If you can come up with fast algorithm for this, you have broken SHA-1. – torek Oct 05 '20 at 09:02
  • I've just noticed your edit, which is probably _the_ answer why it cannot be done. Thanks! – Borek Bernard Oct 05 '20 at 09:40
  • I wonder if `git replace` is actually a solution to this, was amazed by my early experiments yesterday... Do you have more experience with it, @torek? – Borek Bernard Oct 13 '20 at 07:13
  • 1
    Using `git replace` is fine, just be aware of its limitations: the way it works is that when Git is about to look up the object whose hash is X (for any X), it first checks to see if refs/replace/X exists. If it does, Git looks up the hash ID to which refs/replace/X maps instead. (Use `git --no-replace-objects` to avoid having this happen.) There are two main drawbacks: (1) a very large refs/replace/* namespace will eventually build up and make Git slow. (2) clone does not copy these names, so a new clone doesn't have the replacements. – torek Oct 13 '20 at 07:39
  • 1
    These drawbacks tend to doom the use of replacement long-term for the sort of thing you're contemplating. A way around both of those would perhaps be useful. You'd need to experiment and eventually convince the Git maintainers to adopt whatever results you find useful. – torek Oct 13 '20 at 07:41
1

A commit ID incorporates the commit IDs of its parents. This means if two commits have the same ID Git knows not just that the two commits are equal, but also that their entire histories are equal. This is fundamental to how Git works, particularly push and pull. Mess with it at your peril.

It's possible you could do something clever with git-replace, but I have no experience with that.

If this couldn't work, I'd like to understand why. Does Git check that hashes agree with the actual contents regularly? Does it do so only during some operations, like gc or push or pull?

git gc may have issues, but git fsck would lose its mind. You'd never be able to repair a broken repository. And as torek says, pushes and pulls between old and new repositories will get very confused.


I would recommend instead keeping an copy of the original repository around to reference. When you find a reference to an old ID you can still look it up. And if you judiciously rewrite them to reference the equivalent commit in the new repository, eventually you won't need the old repo anymore.

You could speed up this process by searching for hexadecimal strings, checking if they match a commit ID, and replacing it with the new commit ID. A mapping of old to new can be obtained by running git log --pretty='format:%H' on both repositories and comparing them one-to-one.


Update

If you really, really need those Github links to work, you could write an http proxy which redirects https://github.com/your-org/your-repo/commit/oldcommitid to https://github.com/your-org/your-repo/commit/newcommitid.

Schwern
  • 153,029
  • 25
  • 195
  • 336
  • Those commits can also have discussions around them on GitHub, re-creating everything from scratch is just a very disruptive event if it happened distant enough in the past (which is our case). I agree with all your points though . I know we'd be playing with fire. – Borek Bernard Oct 05 '20 at 08:58