0

How many possible hash values does one need to avoid clashes among N items? If you recall birthday paradox, the answer is much smaller than N.

Let's reverse the question: for N=16^10 possible hash values, which corresponds to 10 hex digits of abbreviated git revision codes, with how many revision the probability of a revision hash coincidence rises to 50%? A direct calculation shows that if you have 1234603 revisions the probability that two of them would have the same 10-digit hash is 50%.

Now, a million or so revisions is not unheard of in large active repositories. Have anybody here experienced a git hash clash in your work? Theoretically speaking, that ought to have happened.

Michael
  • 5,775
  • 2
  • 34
  • 53
  • You're right that as you grow the number of objects—in Git, this is commits+trees+blobs+annotated tags, not just the commit count—the number of bits needed for collision safety grows pretty rapidly. The standard trick is to guess at the 50%-collision risk by using the square root of the hash space, so if there are 2n bits you get to ~50% at 2n/2. That's hardcoded into Git's automatic `--abbrev` length calculator but it's not good enough... – torek Jun 03 '19 at 22:55
  • The truth is that you can't guarantee no clashes with *any* number of bits. It's always possible that two items will clash. You can reduce the likelihood by adding more bits, but you're not going to eliminate the possibility entirely. – Jim Mischel Jun 04 '19 at 05:29

2 Answers2

1

Git automatically scales the length of abbreviated hashes as the number of objects increases such that this is usually not an issue. In addition, if an abbreviated hash would be ambiguous at the normal length, Git will automatically produce a longer, unambiguous value. Some commands let you control the length of abbreviations with an option named --abbrev if you want a specific value, and the core.abbrev option can override the default.

However, these names are necessarily only unique at the moment they're created, so if you're producing tools that need to work with revisions, they should always operate on the full object IDs. Note also that there is work underway to switch to using SHA-256, so you should not assume anything about the length of a particular full object ID when writing tools.

bk2204
  • 64,793
  • 6
  • 84
  • 100
0

As explained in "How much of a Git SHA is generally considered necessary to uniquely identify a change in a given codebase?", you can get the minimum required length with git rev-parse --short

 git rev-parse --short=4

But if you want to be sure, and work only with the full lenght:

With Git 2.31 (Q1 2021), the configuration variable 'core.abbrev' can be set to 'no' to force no abbreviation regardless of the hash algorithm.

And that will be important when Git will switch from SHA1 to SHA2.

See commit a9ecaa0 (01 Sep 2020) by Eric Wong (ele828).
(Merged by Junio C Hamano -- gitster -- in commit 6dbbae1, 15 Jan 2021)

core.abbrev=no: disables abbreviations

Signed-off-by: Eric Wong

This allows users to write hash-agnostic scripts and configs by disabling abbreviations.

Using "-c core.abbrev=40" will be insufficient with SHA-256, and "-c core.abbrev=64" won't work with SHA-1 repos today.

[jc: tweaked implementation, added doc and a test]

git config now includes in its man page:

If set to "no", no abbreviation is made and the object names are shown in their full length.

VonC
  • 1,262,500
  • 529
  • 4,410
  • 5,250