0

If I get Git right, each commit comes with an SHA-1 checksum. To generate such hash value, Git also takes the previous commit as the hash function's input. That is to say, except hash value collision (be it an accident or an attack), suppose I see the last commit of two repositories has the same hash value, I can be confident that these two repositories are exactly the same.

Is this understanding correct?

  • 5
    Nope... not the repos. The _branches_ are the same.... or whatever objects you are comparing are the same. Repos hold many of those objects, but they do not have a single hash that you could use to compare them. – eftshift0 May 25 '21 at 17:17
  • @eftshift0 What do you mean by "or whatever objects you are comparing are the same"? –  May 25 '21 at 17:21
  • You might be comparing trees, blobs, branches/revisions. They all have a sha1 ID that could be used to compare them. If 2 blobs have the same ID, they are sure to have the same content. If 2 trees have the same ID, they all have the same structure and file contents.... if 2 revisions have the same ID, they have the same history, same content in all revisions. – eftshift0 May 25 '21 at 17:28
  • Emmm... I guess what you said is a little bit too advanced for my current knowledge. Basically I mean this: after executing `git log` command, I can see a list of commits. Now I have two folders on two computers and I believe that they store exactly the same project (perhaps I should use the term `branch` instead of `project` here?) but I am not 100% sure if any files are damaged/tampered. So I compare the SHA1 of their last commits, and I discover that their values are the same. So how should I draw the conclusion? I say "These two branches are exactly the same"? Is this conclusion accurate? –  May 25 '21 at 17:34
  • Those 2 revisions that you are comparing, they have the same id? They are the same. Same history, same content in each revision (keep in mind, as this is a very important tip: branches are nothing more than pointers to revisions in git). https://git-scm.com/book/en/v2/Git-Internals-Git-Objects – eftshift0 May 25 '21 at 17:38
  • @eftshift0 isn't commit ID just an SHA1 value? https://stackoverflow.com/questions/29106996/what-is-a-git-commit-id If commit ID is SHA1 value, then yes both commits have the same ID. So this is exactly what I asked: same ID == same repo? or same ID == same branch? –  May 25 '21 at 17:41
  • 2
    why do you bring up the word _repo_? a _repo_ has no sha1 id to compare, that's what I told you from the very first comment. – eftshift0 May 25 '21 at 19:23
  • @eftshift0 repo per se does not have an SHA1, but this is exactly what I asked. The rephrase my question, essentially I am asking if I can use the SHA1 of the last commit as the signature of a repo. The purpose of the question is to quickly verify if two repos are exactly the same. –  May 26 '21 at 04:28
  • And there is no such a thing. Each repo might have the same branches are exactly the same revisions and you migyt say they are the same but: Will you also consider stashed objects to compare? Also loose objects? Reflog? Configured remotes? Other things? See what I mean? – eftshift0 May 26 '21 at 09:22

2 Answers2

2

Since the collision of SHA1 is so small that we neglect it, we can treat it as a unique identifier of the content it represents. Therefore if 2 commits from different repos have the same SHA1, then these commits are identical and their history is identical. It doesn't mean that those repos have the same list of commits though.

By the way, this feature is extensively used by GitHub: internally they combine all forks of the repo into 1 big repo. This way the eliminate extra copying.

Stanislav Bashkyrtsev
  • 14,470
  • 7
  • 42
  • 45
  • Thanks @Stanislav Bashkyrtsev, There are some important details which I am not very sure without checking the git's source code... The first issue is, what exactly is being used as the input of the hash function? Does Git consolidate all files into one file and calculate the hash? Or does it only calculate the hash based on the changed part plus the hash value from the last commit? –  May 25 '21 at 17:23
  • 1
    @Mamsds, the hash is the hash of the object. Objects could be blobs (file content), trees (directories), commits (they reference a root tree which in turn references sub-trees and blobs as well as parent commits). So no - it's not a diff which gets a SHA1, it's the full content of the object - so called _loose_ object. – Stanislav Bashkyrtsev May 25 '21 at 17:25
  • The second issue is the one raised by @eftshift0, so does it mean that a commit is a branch-specific operation. That is, same hash value from two commits only means identical branches, not identical repository?... –  May 25 '21 at 17:26
  • 1
    @Mamsds, no, eftshift0 meant something different.. If you have 2 repos with the same commit, it doesn't mean their list of commits is the same. "master" branch in "repo1" may contain more commits than the same branch in "repo2" - therefore the repos wouldn't be identical even though part of their history may be identical. Commits _don't_ exist within branches. Branches are simply references to commits. – Stanislav Bashkyrtsev May 25 '21 at 17:29
  • for your reply the "loose object", I guess you address the question from another aspect. But what I am thinking is, at the end, all information of a repo/branch is saved on a file system, let's just say files are saved in a folder. When you say "it's the full content of the object", can I interpret it as "any change made to any file in the repo/branch folder will change the hash value so the same hash value means exact files in the directory"? –  May 25 '21 at 17:30
  • 1
    @Mamsds, even better, Git objects are immutable. Any change of an object will create a _new_ object. That object will have a different SHA1. The original object will also stay and it will have its old SHA1. – Stanislav Bashkyrtsev May 25 '21 at 18:14
  • @mamsds [Pro Git's section on Git Internals](https://git-scm.com/book/en/v2/Git-Internals-Plumbing-and-Porcelain) should provide the details you're looking for. In particular Git Objects and Git References. You can use `git cat-file -p ` to see what is hashed. – Schwern May 26 '21 at 22:39
2

When two commits in two separate repositories have the same object ID, they will refer to the same history, including all commits, trees, and blobs reachable from them, assuming no hash collisions have occurred.

Note that this does not mean that the repositories are completely identical. Those two repositories might have branches, tags, or other references pointing to different commits, and they may also have different sets of objects referred to by the reflog.

Note that if you are using a SHA-1 repository, it is not safe to rely on the absence of hash collisions. The cost to create a SHA-1 collision is approximately USD 11000, so any medium-sized company or government agency can afford to create collisions. While Git has measures to detect if colliding objects are pushed to a repository, that wouldn't have any effect if the repositories were separate. If you require integrity, you need to use a SHA-256 repository instead.

bk2204
  • 64,793
  • 6
  • 84
  • 100
  • _"The cost to create a SHA-1 collision is approximately USD 45000"_ - any source for this? – 1615903 May 26 '21 at 04:05
  • 1
    @1615903 same question: `The cost to create a SHA-1 collision is approximately USD 45000`. My understanding is that at this stage SHA-1 is not cracked. That is, no one, including NSA/FBI, can craft a collision within a reasonable time, no matter how much you are willing to pay. –  May 26 '21 at 04:25
  • 1
    My estimate was outdated. [According to these researchers](https://sha-mbles.github.io/), the cost is now USD 11000. Those researchers have indeed produced a collision (two, in fact), which you can download and verify for yourself. SHA-1 is insecure. – bk2204 May 26 '21 at 22:21
  • 1
    Understand that "produce a collision" involves simultaneously modifying two different objects until they produce the same unpredictable hash code. No one can produce a second-preimage collision of the kind needed to attack an existing history even for MD5, a hash so "weak" there's probably a Fisher-Price toy that can engineer the random collisions, and there's a reason no engineered collisions are in plain text, the format used for source code: it's read directly by humans. the random bs needed to steer the hash engine into the collision looks like random bs to humans. – jthill May 26 '21 at 23:06
  • @bk2204 emmm...seems you are right! Perhaps I confused SHA1 with SHA256. –  May 28 '21 at 04:30