0

Usually, I add multiple related repos as remotes to my local git folder. Such as one for upstream and one for myfork.

And I sometimes do some work on myfork.feature1branch and then do a cross-repo rebase onto the latest upstream.main branch.

To achieve the rebase, I think git should do this:

  1. Find the most recent common base commit of myfork.feature1 and the upstream.main branches.

  2. Pull in the latest commits from the upstream.main onto the myfork.feature1.

  3. And then apply my commits since the common base commit on top of the latest commits just pulled in from upstream.main.

So the common base commit is the critical node.

My question is, if git finds 2 commits in both myfork.feature1 and upstream.main branches that have the same SHA hash, is it safe to assume that is the common base commit? Are all the histories before that guaranteed to be the same?

smwikipedia
  • 61,609
  • 92
  • 309
  • 482
  • 3
    "if git finds 2 commits in both myfork.feature1 and upstream.main branches that have the same SHA hash, can it safely assume that is the common base commit? Are all the histories before that guaranteed to be the same?" If two commits have the same hash, they are the same commit. That means they have the same "contents" and the same parent(s). I didn't understand any of the rest of what you said, but if that's the question, that's the answer. – matt Aug 23 '22 at 03:55
  • 3
    Yes, they are the same. If you worry about hash collision in git, take a look at https://stackoverflow.com/questions/10434326/hash-collision-in-git. – ElpieKay Aug 23 '22 at 07:36

1 Answers1

3

There are two different questions we can answer here:

  1. Q: Does Git assume that two commits with the same hash ID are in fact the same commit (and therefore have the same history behind them)?

    A: Yes. Git makes this assumption.

  2. Q: Is this assumption always valid? That is, are there any cases where we can generate a hash collision?

    A: No. It is at least theoretically possible to have a hash collision. Furthermore, SHA-1 is now nominally broken (as in, it's now feasible to attack SHA-1.) The chances of any two specific objects having a collision are minuscule (2-160), but an attacker can craft particular files that will cause blob object hash ID collisions.

For more on question 2, see also Hash collision in git and How does the newly found SHA-1 collision affect Git? Note that the birthday paradox means that as you add more objects to a pool of objects, the collision chance rises rather rapidly. On the other hand, commit objects have a known and specific format: while the log message can contain arbitrary bytes, generating a deliberate collision will leave clear tracks, but to avoid accidental collisions, we must choose an acceptable error rate.1

Having done this, we can compute how many objects can be in the object store before we have reached this given probability level. To achieve the same probability in a Git repository, we need to keep the number of keys (hash IDs) below about 1.71 x 1015 (1.71 quadrillion). This is an enormously large repository. Each key is itself 20 bytes and each value is whatever size the average object size is. Let's assume just 1000 bytes total, which is going to be ridiculously generously small for a repository this large, but makes our calculations easier: the repository would be about 1.7E18 bytes. That's about 1546141 terabytes, or 1510 petabytes, or 1.5 exabytes.

(This is all before going to SHA-256, which makes things even safer.)


1We already do this for storage media. For instance, physical disk drive manufacturers quote their undetected-error-rate at about 10-18 or so.

torek
  • 448,244
  • 59
  • 642
  • 775