How can git store and unpack so much data via a small SHA-1 hash?

Question

I understand that git uses SHA-1 to come up with a hash given the contents of the file. However, I still cannot see how git 'unpacks' this 40 character hash into a full file which could be very large. It seems like magic that it can store such a small amount of data (40 characters) and then use this to provide arbitrarily large file.

Is there something I am missing here?

You are probably missing that this hash is not used to "unpack" the data. It is only used to _reference_ the existing data in a git repo. — Mikhail, Jan 25 '23 at 09:07
It's basically the same thing as looking up user information in a database when you know that you want to know stuff about the user with the id 1. Except that the ID isn't some automatically increasing number, but calculated from the actual data instead. — Joachim Sauer, Jan 25 '23 at 09:37

knittl · Accepted Answer · 2023-01-25T11:41:15.970

2

It doesn't. The hash is only used as a key to lookup the data. The full data is stored on disk (zlib compressed).

See e.g. files .git/objects/xx/xxxx... – the file path is the hash, the content of the files is the tag/commit/tree/blob content.

The question How is the Git hash calculated? has very detailed explanations.

edited Jan 25 '23 at 11:41

answered Jan 25 '23 at 09:09

knittl

246,190
53
318
364

Thanks, I will check out these content files. I take it they are very aggressive/efficient with their compression, hence why large repos can be sent over network quickly? Are there any useful links on sizing/compression? – mars8 Jan 25 '23 at 12:05
2

@mars8 loose objects use standard zlib. Most repository content is source code and text compresses well. Git is additionally employing "delta compression" when creating packfiles containing many objects, which further helps in bringing the size down. – knittl Jan 25 '23 at 12:15

How can git store and unpack so much data via a small SHA-1 hash?

1 Answers1