2

Say, I have a file with the content :

1111
2222
3333

Then I modify it to:

1111
2222
4444
3333

Does Git generate a new file with newer version ? I'm confused, If it creates new file, then would the whole repository size grows very quickly ?

Another thought is, Git doesn't create new file, just store where to add or where to delete lines, and store the new lines content.

Which one is correct ?

WoooHaaaa
  • 19,732
  • 32
  • 90
  • 138

3 Answers3

4

Many older source control systems, such as RCS and CVS, specifically store differences between versions of files. For example, the information for a given source file might be stored in the repository in a form that includes the full text of the latest version, plus "instructions" for generating earlier version.

Git, at least conceptually, stores the entire content of each version of every file in the repository. It saves some space by storing only one copy of identical files, since the name used to store it is determined by hashing the contents.

Obviously if that were the whole story, Git repositories would become very large very quickly. But Git automatically packs, or compresses, stored objects. I frankly don't know all the details, but it does a good job of both minimizing storage space and permitting arbitrary versions to be recreated quickly.

For example, the Git sources are themselves stored in a Git repository, which contains probably thousands of distinct objects. All the versions of all the files are stored under the directory .git/objects/pack, which currently contains the following (the listing is of a clone on my system):

$ ls -l .git/objects/pack
total 48900
-r--r--r-- 1 kst kst  4196172 Mar 20 15:44 pack-0e69de7b7728ad0fde80423ded259dbff7760016.idx
-r--r--r-- 1 kst kst 36698393 Mar 20 15:44 pack-0e69de7b7728ad0fde80423ded259dbff7760016.pack
-r--r--r-- 1 kst kst   125896 Jun 30 22:17 pack-2848a675d3c196391f06cc7cdd6cebf67fb7119e.idx
-r--r--r-- 1 kst kst  3570770 Jun 30 22:17 pack-2848a675d3c196391f06cc7cdd6cebf67fb7119e.pack
-r--r--r-- 1 kst kst   178452 May 16 08:22 pack-bfd75de39dff6ac03adcc775f7b5715480b54637.idx
-r--r--r-- 1 kst kst  5292998 May 16 08:22 pack-bfd75de39dff6ac03adcc775f7b5715480b54637.pack

What's different about Git compared to earlier systems (at least to the earlier systems I've used) is that, on a high level, all versions of all files in the repository are stored in full, but the compression is provided by a separate layer.

Keith Thompson
  • 254,901
  • 44
  • 429
  • 631
  • "Git stores the entire content of each version of every file in the repository", Could you provide any links about this ?So many thanks ! – WoooHaaaa Jul 01 '13 at 06:44
  • Some details of the compression scheme are described in the [git book](http://git-scm.com/book), but the short version is: individual "files" (this includes tree and commit entries!) are compressed with zlib, and then packs can have delta-compression applied to objects within them. Gory details here: http://stackoverflow.com/questions/9478023/is-the-git-binary-diff-algorithm-delta-storage-standardized – torek Jul 01 '13 at 07:30
  • "Git stores the entire content of each version of every file in the repository" Not exactly. Git doesn't store duplicates of the same object. – Charles Lew Jul 01 '13 at 09:36
  • @Charles lew, So how does Git define `object` ? One line or a single file ? – WoooHaaaa Jul 02 '13 at 03:05
  • @MrROY Git has four kinds of `object`s: `blob`, `tree`, `commit`, `tag`. A specific version of a single file is a `blob`. Git never stores single lines. – Charles Lew Jul 02 '13 at 08:19
  • @CharlesLew: Yes, and I mentioned that in the next sentence. Each version of every file is stored; some of them may be stored in the same place. – Keith Thompson Jul 02 '13 at 16:02
1

Git just stores content changes across projects. An incremental difference. At any given point in time any file which is the same as some prior file is recorded as a pointer to an object describing that prior file's contents. It uses hashing on file contents to know when there are changes to a file and to find matches to prior versions so it does not have to store the same thing more than once.

It also has a simple database that describes all the changes and their relationship.

Here is some documentation on how the repository is organized:

https://www.kernel.org/pub/software/scm/git/docs/gitrepository-layout.html


Additional note about space savings: Git's big space saving turns out to be not storing the same file twice. Other content managers don't use pointers to a file version as Git does and this results in a huge savings over the lifetime of a projects versions. Since with move version for a project only a few files change.

Hogan
  • 69,564
  • 10
  • 76
  • 117
  • I understand a file is same as prior file will be recorded as one, but if a file just changed a few lines, its hash code will be different at all, does Git just store the new lines too ? – WoooHaaaa Jul 01 '13 at 04:26
  • Nope whole file -- But when you transfer (eg when upload to github) it does a compress. The space saving is on a project level compared to CVS and others. – Hogan Jul 01 '13 at 04:32
  • Sorry ... I'm not a native speaker ... "Nope whole file" means, "not whole file" ? :D – WoooHaaaa Jul 01 '13 at 04:57
  • @MrROY: "Nope" is an informal version of "No". I believe what he means is "No, it stores the whole file". – Keith Thompson Jul 01 '13 at 05:13
1

For ordinary files, git stores them as blob objects, and git stores each version of your file as separate blobs. So they are separately stored. This has the advantage that you can check out some commit very fast (instead of backtracking and performing all the patching actions).

For the repository size problem, git provides a object packing mechanism and compresses data automatically (or on your demand). This is not a big problem in most cases.

Charles Lew
  • 118
  • 5
  • With the help of `git ls-files` and `git cat-file` you're able to actually SEE the content of each individual object, and you'll see this is true. – Charles Lew Jul 01 '13 at 09:41