1

I always thought that every single git object has a unique sha. Then, when I was listing a git tree, I found this:

...
100644 blob fc47072354934eb062321af9d1c4897244562b67    exp2f-inputs
100644 blob fc47072354934eb062321af9d1c4897244562b67    expf-inputs
...
100644 blob 7eb7bda5e433f5df0fd6fec001c69cab7a08ebdb    fmaxf-inputs
...
100644 blob 7eb7bda5e433f5df0fd6fec001c69cab7a08ebdb    fminf-inputs
...
100644 blob 50a97394769447a692318ccefe333b494da7cc97    log2f-inputs
100644 blob 50a97394769447a692318ccefe333b494da7cc97    logf-inputs
...

Those files are from glibc.

My question is, are those sha not supposed to be unique for every single git object?

Mas Bagol
  • 4,377
  • 10
  • 44
  • 72

1 Answers1

5

Every single Git object does have a unique SHA. That tree object you're listing contains multiple references to the same blob object.

A blob object is, basically, the contents of a file. Those two files have the same contents, so Git stores them as the same blob.

$ echo 'basset hounds got long ears' > file1
$ cp file1 file2
$ git hash-object -t blob file1 file2
a55bd80950a2a5fc0b43b76ec1b3da190efcd212
a55bd80950a2a5fc0b43b76ec1b3da190efcd212

Here's an illustration of the relationship between tree and blob objects from the Git Objects chapter of the Pro Git book.

enter image description here

That's how this file tree is stored...

new.txt       "new file"
test.txt      "version 2"
bak/
    test.txt  "version 1"

Incidentally, this is how Git can store complete snapshots of every file at each commit efficiently. Since each commit usually only changes a few files, commits mostly reference the same tree and blob objects.

Schwern
  • 153,029
  • 25
  • 195
  • 336
  • So, I can't rely on those SHA's to uniquely identify blobs? – Mas Bagol May 30 '18 at 20:45
  • 2
    @MasBagol You can! But you have to recognize what a blob is. It's just the content of the file. The filename (and I think permissions) are stored in the tree. This is analogous to how a filesystem works, directories and inodes. – Schwern May 30 '18 at 20:47
  • "Unique" modulo collisions. – jub0bs May 30 '18 at 20:50
  • @Schwern I see. So a blob is not file, and a blob is just the content of file(s), am I correct? If so, can those SHA's used to uniquely identify files? – Mas Bagol May 30 '18 at 20:51
  • @MasBagol: you'll have to define what *you* mean by "file" here. This question is more subtle than it looks at first! Is `foo.ext` the same file as `bar.txt`? What if I've run `mv foo.ext bar.txt`? (Compare with the philosophical question of [the Ship of Theseus](https://en.wikipedia.org/wiki/Ship_of_Theseus).) – torek May 30 '18 at 20:53
  • 1
    @Jubobs Non-engineered collisions are astronomically unlikely and [how Git handles them is well understood](https://stackoverflow.com/questions/9392365/how-would-git-handle-a-sha-1-collision-on-a-blob#34599081). – Schwern May 30 '18 at 20:53
  • @MasBagol I think it would be best if you backed up and told us what you're trying to accomplish. – Schwern May 30 '18 at 20:54
  • What I try to accomplish is like to make a flat JSON map of git tree. And I was using those SHA as key which is doesn't work if it's not unique. – Mas Bagol May 30 '18 at 20:58
  • @MasBagol In that case do what a tree object does. For a file the key is the filename and the value is the blob ID. For a directory, the key is the directory name and the file is another tree. – Schwern May 30 '18 at 21:00