2

I'm working on a new version of git-stats a tool to make some graphs based on Git commits, authors etc.

In the current version it accepts identical commit ids, namespacing the project names:

{
   "some-project-url" { "hash1": "date", ... }
   "some-project-url-fork" { "hash1": "date", ..., "commit-in-fork-id": "date" }
}

I want to remove the requirement of storing the project url, that means not have identical hashes.

Now I'm thinking if this is a good move.

When multiple projects are imported and each commit is stored once, what is the probability to have two identical ids?

Actually, in the real life, when does it happen to have two identical ids (in two different projects)?

Ionică Bizău
  • 109,027
  • 88
  • 289
  • 474
  • 1
    Same question as for SHA-1 Collision since the git hash is created with SHA-1 – Zelldon Jul 10 '15 at 07:12
  • Since Git commit IDs are SHA-1 hashes, you’re essentially looking for SHA-1 hash collisions. Those are of course possible (they are mathematically guaranteed :P), but so far, collisions have been difficult to find with SHA-1. In normal GIt projects, it’s unlikely that you will ever run into a collision. Taking the Linux project as an example, the largest Git project to date, the hashes are that unique there that they can still refer to commits using only the first few characters, so there are not really collisions, no. – poke Jul 10 '15 at 07:20
  • Read [this one](http://www.quora.com/What-is-the-probability-that-two-two-commits-of-different-git-repositories-have-exact-same-SHA-hash) – ckruczek Jul 10 '15 at 07:21
  • @poke true, you can refer to abbreviated SHA1, but that has evolved over time: http://stackoverflow.com/a/21015031/6309 – VonC Jul 10 '15 at 07:24
  • @ckruczek Interesting -- so, my users should be luckily to find two identical ids in their projects. :) – Ionică Bizău Jul 10 '15 at 07:26
  • They won't find any :) – ckruczek Jul 10 '15 at 07:28
  • 1
    @Joost Ah, indeed! True, true, true! – Ionică Bizău Jul 10 '15 at 07:33
  • I've added it as an answer now, instead. – Joost Jul 10 '15 at 08:13

2 Answers2

5

SHA-1 hashes consist of 160 bits, allowing for 2^160 = 1.4615e+48 combinations. The birthday paradox makes it so that it'll only take roughly the root of this number (roughly 2^80) to get a 50% chance of collisions, but that's still enormous. Note, however, that the input to the hash is not at all uniformly random, as it is simply the hash over the commit data (see here).

I suppose the most likely reason for a collision is not SHA1, but an exact match in the input data. And that seems highly unlikely, given that author details and timestamps are in there as well.

All in all, using commit hashes to identify commits seems sufficiently identifying to use across different projects without any real risk of trouble.

Joost
  • 4,094
  • 3
  • 27
  • 58
1

There are three cases to consider.

  1. Two different non-malicious commits that happen to have the same commit ID.
  2. Two different commits deliberately constructed to have the same commit ID
  3. Two projects that contain the same commit.

Case 1 is extremely unlikely.

Case 2 is possible now that sha1 collision techniques have been constructed but github at least have put countermeasures in place to try and block such commits.

Case 3 is actually the most likely. Many projects are forks of other projects.

plugwash
  • 9,724
  • 2
  • 38
  • 51
  • *Case 3* is not really a thing, because if they are a fork, it's the same commit. What I'm wondering is: do we have an example of Case 2? How do construct two different projects containing at least one identical commit hash (commit id collision), but with different metadata (either author/message/changes etc), and still to be valid repos? Thanks! – Ionică Bizău Jun 18 '17 at 02:54