4

If I, in various directories, have several files, with different filenames, but with the exact same content... Will each duplicate increase the repo size, or will they be stored as "one" file?

For example, if the file is 100 kB, and it's duplicated 10 times in the repository (same content, different directories, different file names). Will the repository be 100 kB or 1000 kB?


Note: Could've semi-tested this myself, and seems I could've eventually found the answer if I read through the long answers in the linked possible duplicates. But, I want a quick, short and clear answer from someone who know what they're talking about, and I want it to be the first result in a google search. Don't know if this will be that, but when I was searching for an answer to this question, there definitely wasn't any immediately clear answers in my search results.

Svish
  • 152,914
  • 173
  • 462
  • 620
  • What prevents you from trying it yourself? – choroba Mar 22 '19 at 14:12
  • Nothing really, but git repos contain a lot of files and I'm not sure what they all do or how they are connect, so not sure exactly to look for. So figured it'd be easier to ask someone who maybe did know for sure how exactly git handles duplicate files. – Svish Mar 22 '19 at 14:15
  • You can create a repo with only two files in it, see what happens. That way, you wont' have a repo with "a lot of files". The first time, create it with two identical files. The second time, create it with two different files. – Raymond Chen Mar 22 '19 at 14:17
  • Note that *duplicate files* is different from *almost but not quite duplicates*, which is the focus of https://stackoverflow.com/questions/25661952/does-git-de-duplicate-between-files – torek Mar 22 '19 at 18:22

2 Answers2

5

Nope..... git only saves content once and then it will point to it multiple times as needed.... so if you have the same content 100x times with different names/different paths, the file will be saved once and then you will have 100 pointers to it.

eftshift0
  • 26,375
  • 3
  • 36
  • 60
  • Cool, thanks. That's what I was hoping and expecting, but couldn't find it expressed that clearly. – Svish Mar 22 '19 at 14:15
  • So for a matter of clarity, if I have fileA.txt that contains `Hello, World!` and fileB.txt that also contains `Hello, World!` the size of the repo will not be the size of both, but rather one of the files? – dimwittedanimal Mar 22 '19 at 14:19
  • 1
    @dimwittedanimal At those scales, when the metadata (filename and a hash) is larger than the size of the file, the repo will increase by more than the size of the file, but say you have the text of War and Peace in the file, the repo will only increase by the few bytes needed to add the file reference to the tree. – LightBender Mar 22 '19 at 14:24
  • 1
    Well, not exactly because you still have the tree objects to make up the paths, specially if you put the file in different locations in different revisions... but at least in terms of the repo increasing size because you have the same content multiple times doesn't happen because content will be saved only once. – eftshift0 Mar 22 '19 at 14:24
  • There is one hitch here having to do with pack files. Normally, you will have one pack file containing the one big file with its one unique hash ID. However, if you use "keep" files to *keep* a pack even when it gets "repacked" into a new pack, that one big file, with its one unique hash ID, could appear in *more than one pack file* (the new one, plus the retained old one). – torek Mar 22 '19 at 18:26
1

You can use git rev-list --objects --all to show all objects stored in the database. The duplicate file content will be shown only once if the files are part of the same pack.

For example in a new repo with a.txt and b.txt, which are the same, committed in two separate commits a.txt first:

$ md5sum *.txt
3ac628079d9cf781d155c26dabaade91  a.txt
3ac628079d9cf781d155c26dabaade91  b.txt

$ git rev-list --objects --all
f0b4bdc93a65012069d6e96d54624a34ee1d1552
9f8a9ceb3b5f22e67b86b6d57837def070802baf
a19cc397dae6a39fc4f9fbdbd4bf9da05c00bef0 
d05accac53d462a927e7787edee5fb97db24c386 a.txt
d5bc7e22610744c7717f65d3ec60957583469857 
Karol Dowbecki
  • 43,645
  • 9
  • 78
  • 111