2

According to my understanding, .git contain all the blob and commit object. So it always be larger than file in the working directory.

How can this happen? Because the repo contain many small files(less than the block size), git will compress them?

Any one can explain it more detail?

update with more detail info

Actually the repo make me confuse is the cocoapods master repo. This repo is used to store the ios lib specifications. When a new version of a lib is released, a new file added (no edit of existing ones ) to this repo. The new spec added is usually very similar to previous version, maybe only the version is changed.This will add at least three objects to the repo, a blob, a tree and a commit.

Use du -d 1 h, the size is

1.1G ./Specs

729M ./.git

Community
  • 1
  • 1
Karl
  • 665
  • 4
  • 19

1 Answers1

3

So it always is larger than the file in the working directory.

Nope.

In order to understand you need to know how git stores its data.
Git uses heuristics to find similar parts of your code. In other words, when git finds identical content (whole file or part of it) it doesn't store it twice but instead, it stores it once and uses pointer o point to the first occurrence. This is known as hunks.

Whenever you execute git add, git grabs the content, "sets" up the hunks and stores them later on inside the pack file. So back to track, when you execute git add git grabs the content, hashes it using sha1sum, hash-object and more, zips it and stores it inside your .git/objects folder.

The "real" content of your files (once git packs it later on) are simply smaller chunks known as hunks and git knows how to index them into your original file.


What are hunks?

Hunks are patch files. You can see them when you execute git add -p and then, if you have multiple changes on several locations in your files, choose the s and you will see them.

These are the options you can do within add -p:

y - stage this hunk
n - do not stage this hunk
q - quit, do not stage this hunk nor any of the remaining ones
a - stage this and all the remaining hunks in the file
d - do not stage this hunk nor any of the remaining hunks in the file
g - select a hunk to go to
/ - search for a hunk matching the given regex
j - leave this hunk undecided, see next undecided hunk
J - leave this hunk undecided, see next hunk
k - leave this hunk undecided, see previous undecided hunk
K - leave this hunk undecided, see the previous hunk
s - split the current hunk into smaller hunks
e - manually edit the current hunk
? - print help

Once you use the s it will pick the chunk of code which can be considered as a standalone change. If you want to split it even more, you will have to use the e to edit the hunk and then add it back to the stage area.


Git stores "patches" which are the delta of your changes, but git adds a few other "layers" on top of it. It reuses the same content once it "sees" it, (it's being done using the heuristics) and adding only "new" changes while pointing to the old ones.

Later on git grabs the content and packs it using ZIP.


enter image description here

Community
  • 1
  • 1
CodeWizard
  • 128,036
  • 21
  • 144
  • 167
  • Thanks. How can find these patch? – Karl Dec 08 '18 at 10:55
  • best way is to use `grepdiff 'console' --output-matching=....` read about this tool, if you are on windows it might be a problem – CodeWizard Dec 08 '18 at 10:59
  • Hi, I have update the question to provide more info, do you have time to take a look. And I also tried the add -p, seems it only work when I made change to the same file, if two file has similar content, it can't detect these change. – Karl Dec 09 '18 at 08:51
  • "if two file has similar content, it can't detect these change" of course, git re-use the same content and does not duplicate it even when you copy or rename file – CodeWizard Dec 09 '18 at 08:53
  • Similar, a little different, but not the same. Sorry for not clear expression. – Karl Dec 09 '18 at 08:59
  • Git does *not* store "patches which are the delta of your changes". Instead, it uses both of "compressed versions of individual files" and "differences between similar files". These latter are not tied to changes made to files, but it often amounts to that. – j6t Dec 09 '18 at 09:06
  • @CodeWizard i have read that git store complete snapshots unlike centralized version control system like svn, now I m getting confused. is git storing delta or complete snapshot if file is modified? – Aayush Neupane May 09 '21 at 17:10
  • It's more complicated than that, git store hunks, git "calculate" the differences, and that store the. Git search for code similarity based upon the Histogram algorithm, check for similarity of code based upon percentage, and more. to make it short: git compares the code, creates a path (called hunk) and then stores those hunks as differences. Search for explanation oh how gt does its heuristics, as mentioned above it based upon number fo lines changes, percentage and more. – CodeWizard May 31 '21 at 07:45