0

I have an old legacy .NET project with tons of DLL (and no package manager like NuGet).

The total size of the files is around 1.5 Go

When I initialize a git repository with this project, the total size of .git is < 300 Mo, how is it possible git compacts binaries more than the best zip tool can ?

UPDATE : After digging @mvp comment, I've found that some dlls in this project are duplicated up to 20 times.

$ find . -name '*.dll' -exec basename {} \; > dlls
$ cat dlls | sort | uniq -c | sort -nr | awk '{ print $2, $1 }'

Waiting for some answer about how git identifies "duplicates" and manage them.

Tristan
  • 8,733
  • 7
  • 48
  • 96
  • Perhaps you have a lot of duplicate files? git is designed to never keep duplicate objects in object store. But internal git compression is pretty good too. – mvp Jan 28 '20 at 09:55
  • I'm doing some bash commands to check this, but you may have found the explanation, because my "project" is made of multiple modules and these modules have common dependencies. Can you make this as the answer, and add a link to some explanation about how git stores duplicate ? (based on name ? hash ? name + hash ?) – Tristan Jan 28 '20 at 10:42
  • Git hashes all your content and merely compares hashes (not actually file contents) to detect changes. This is what makes git fast. As a side effect you get free duplicate detection (file hash matching) – slebetman Jan 28 '20 at 10:52
  • The hash is based on the file's content. It's a checksum of the internal "blob content", which is the file plus a little header that Git needs to know the internal object type ("blob") and size-before-compression. The hash is currently SHA-1 but Git is being converted to use other algorithms now. – torek Jan 28 '20 at 16:22
  • I think the question I ask (my use case) is clearly different from the 2 other questions supposed to be the same. It should be useful to keep it in SO as an entry point to this git structure knowledge. – Tristan Jan 29 '20 at 15:20

1 Answers1

1

Since no one wants to write their answer here :

git creates a hash for every file and stores all the files with the same hash only once, wether it's files from different revisions with same hash, or files from same revision in different directories with same hash.

Tristan
  • 8,733
  • 7
  • 48
  • 96