19

As I understand, some VCSs store differences between revisions, because, well, the differences are sometimes small - one line in a source code is changed or a comment is added in a subsequent revision. Git, on the other hand, stores compressed "snapshots" for each revision.

If only a small change has been made (one line in a large text file), how does Git treat this? Does it store two copies that are almost identical? This would be an inefficient use of space, I'd think.

flow2k
  • 3,999
  • 40
  • 55
  • 1
    Not really a duplicate, but this question explains the answer to your question and then goes a step further: [Are Git's pack files deltas rather than snapshots?](https://stackoverflow.com/questions/5176225/are-gits-pack-files-deltas-rather-than-snapshots) – Greg Hewgill Apr 12 '17 at 03:14
  • 3
    Git does, initially, store two copies that are almost identical. In practice it's not much of a problem. The objects eventually—there is no precise time bound, but almost always before transmission to another Git—get compressed into *pack files* that *do* use delta encoding; see @GregHewgill's link. – torek Apr 12 '17 at 08:17

3 Answers3

34

Does it store two copies that are almost identical? This would be an inefficient use of space, I'd think.

Yes, Git does exactly this, at least at first. When you make a commit, Git makes a (slightly compressed) copy of your source files under the .git/objects/ tree, with a name based on the SHA1 of the contents (these are called "loose" objects). You can go look at these files, and it's worthwhile to do so if you are curious about the format.

The point to remember is that Git is built for speed, and doesn't care very much about the size of the repository data. When Git wants to get an old revision to look at it, all it has to do is read the file as-is from the .git/objects/ tree. No application of deltas, just raw reading bytes with zlib decompression (which is very fast).

Now, you would be correct to observe that after you use a repository for a while, the files in .git/objects/ would contain a great many copies of your source files, all just a little bit different. That's where "pack" files come in. When you create a pack file (either automatically or manually), Git collects all the file objects together, sorts them in a way that will compress well, and compresses them into a pack file using a number of different techniques.

One of the techniques used when creating pack files is indeed, delta compression. Git will notice that two objects look very similar, and store one of the objects and a delta difference between them. Note that this is done on purely an object basis as raw data, without regard to the order in which things were committed or how your branches are arranged. The low level pack file format is an implementation detail as far as the rest of Git is concerned.

Remember, Git is still built for speed, so pack files are not necessarily the absolute best compression you can possibly get. There are a lot of heuristics in pack file creation related to tradeoffs between speed and size.

When Git wants to read an object and it's not a "loose" object, it will look in the pack files (which are in .git/objects/pack/) to see if it can be found there. When Git finds the right pack file, it extracts the object from the pack file, applying whatever algorithm (delta resolution, decompression, etc) is needed to reconstruct the original file object. The higher level parts of Git do not care how the pack file stores the data, which is a good separation of concerns and simplifies the application code.

If you want to learn more about this, I suggest reading the Pro Git book, specifically the sections

Greg Hewgill
  • 951,095
  • 183
  • 1,149
  • 1,285
  • 1
    Greg Hewgill, thank you for the answer. I am indeed working through the Pro Git book, and discussion on the question you linked to earlier. But if I may ask one thing now, that would be: what triggers an automatic creation of pack files? Certainly, Git is not running in the background, so I assume it is done as a side effect (automatically) when I issue some command (not the explicit, manual creation of pack files). – flow2k Apr 12 '17 at 08:59
  • 1
    I don't know the exact rules but there is something like "if you get 1500 or more loose objects, Git will do a garbage collection that *may* create pack files". Obviously the exact rules are that, more exact, but `git gc` may do it. – Lasse V. Karlsen Apr 12 '17 at 09:00
  • 3
    @flow2k: Yes, automatic packing is a side effect of running other commands. See the [`git-gc` documentation](https://git-scm.com/docs/git-gc) (specifically the `--auto` switch) for further information. – Greg Hewgill Apr 12 '17 at 09:01
  • Can elaborate the following 2 together? You said “ No application of deltas, just raw reading bytes with zlib decompression (which is very fast).”, then from the Packfiles link it says: “ but that 033b4 only takes up 9 bytes. What is also interesting is that the second version of the file is the one that is stored intact, whereas the original version is stored as a **delta**”. If it’s stored as a detla, Doesn’t this refute the claim that there is no applicaiton of deltas? – mfaani Feb 07 '21 at 11:48
8

How git stores the actual commited files vary over the lifetime of your repository but let's begin with the basics.

When you commit a file to your repository, a new file, a complete copy of this file is made. The SHA1 is calculated from its contents, and this is the "object id" of this file.

You can find this file under .git\objects\SH\A1-hash

The SH\A1-hash there is my way of indicating that the first two characters of the SHA1 is used as a folder name and the 38 rest is used as the filename inside that directory.

Then you modify this file, add it to the index, and commit it.

This is again stored as a completely new file indexed the exact same way as above.

This is very easy to test but bear in mind that whenever you make a commit that changes 1 file you get 3 git objects:

  • The new version of the file
  • A "tree" object, indicating which version of every file in your index to use for this particular commit
  • The commit object, storing references to its parent(s) and the tree.

So yes, git stores files as complete snapshots. Note that these files are compressed, so they're not taking up quite as much space as two complete copies of this file but they're taking up as much space as two complete compressed copies of this file.

If the file being added doesn't lend itself to compression very well (think jpg, png or zip files), then yes, this will take up a lot of space.

At some point Git may decide to pack your repository, and here Git may decide to use delta-compression (compress and store the differences between files) inside this packfile. However, the rest of Git doesn't see this as this is an abstraction on top of the underlying file access inside Git. The various Git commands implementations will still see the "un-deltified" (if there is such a word) files.

Now, various commands will invariably hide this from you because most of the git commands you use, if implemented well, hides all the underlying abstractions and optimizations from you, the developer, and instead focuses on what you probably want to see.

So if you look at these files, some of the commands will show diffs, where the underlying files aren't stored as diffs, simply because a diff makes more sense to you, the developer.

If you instead go and use the plumbing commands, you will see more of the blobs.

If you want to see how all this work out in practice there is just 1 command you need to know, and that is git cat-file -p SHA1.

Here's a way to test this:

  1. Initialize a new repository
  2. Add a file and commit it
  3. Execute git log and copy the SHA1 of the commit
  4. Execute git cat-file SHA1-of-commit and you will see something like this:

    tree d7d68c5b2ecc58da225c953e35b0797a4805b844
    author Lasse Vågsæther Karlsen <lassevagsaether.karlsen@visma.com> 1491986419 +0200
    committer Lasse Vågsæther Karlsen <lassevagsaether.karlsen@visma.com> 1491986419 +0200
    
    First copy
    
  5. Now make a copy of the SHA1 id after tree, this is the object id of the tree object, then execute git cat-file SHA1-of-tree-object, and you will see something like this:

    100644 blob 3b5d02884e6a17f20ed7938bf9e534f1bd0d195e    Temp.7z
    

    This tells you that the index contains 1 file (1 line), with the filename Temp.7z, and it tells you its SHA1 id. Copy this id.

  6. Execute git cat-file -p SHA1-of-blob and you will see the contents of the file you added.

The storage model of Git is not magical or complex at all, but there are lot of optimizations and abstractions in there to avoid wasting space, de-duplication, and so on.

Lasse V. Karlsen
  • 380,855
  • 102
  • 628
  • 825
0

Git use patches or hunks. It calculate the diff introduced between the 2 version and store it.

store two copies that are almost identical? This would be an inefficient use of space, I'd think.

Git scans your code (heuristics) and once only store differences. If git finds the same code in multiple files it generate hunk for the similar code and store pointer to it in the the original location.

To make it simple - its much more complicated than how its explained below, making it simple so you can understand it more easily.

Once your code is scanned git search for changes from previous commit, if a change is found git split the old change to a hunk.
If you added code in a middle of the file so it will be splitted to 3 hunks (top = old code, middle - new code, bottom - old code) and now you will have 3 hunks. Next time git will scan your code he will use those 3 hunks to search for changes.

For example: Lets say that you have a bunch of files with the license agreement on top of each file and this is identical in all of your files.
Git will scan the files and the first hunk will be stored as patch, on all other files git will place a pointer pointing to this hunk.

This way git store the information in a very efficient way.


If you want to see it action use git add -p and select s for split.

enter image description here


The patch itself looks like:enter image description here


As explained above hunk is a diff and here is a little bit about that. hunk is a term related to diff, and here is how git display it visually (patch):

The format starts with the same two-line header as the context format, except that the original file is preceded by --- and the new file is preceded by +++.

Following this are one or more change hunks that contain the line differences in the file.
The unchanged, contextual lines are preceded by a space character, addition lines are preceded by a plus sign, and deletion lines are preceded by a minus sign.


More info:

https://github.com/mirage/ocaml-git/blob/master/doc/pack-heuristics.txt

Community
  • 1
  • 1
CodeWizard
  • 128,036
  • 21
  • 144
  • 167
  • Thanks CodeWizard. "here is how git display it visually (patch):" is the visual display your screenshot above? – flow2k Apr 12 '17 at 03:45
  • 2
    This is the output of the `git add -p` and what you see is a formatted patch. If you wish to see that patch itself use this: `git commit --verbose` on your next commit and you will see the exact patch – CodeWizard Apr 12 '17 at 03:49
  • 7
    This is technically incorrect in two ways: (1) Each snapshot is *logically* independent, and initially, the modified large text file *is* stored separately (which *is* somewhat inefficient). Git does not store diffs. (2) When Git *does* choose to compress objects into pack files and converts the objects into delta-compressed versions, the instructions for modifying one version to produce another are not text diff hunks; they are a customized variant of Xdelta (https://en.wikipedia.org/wiki/Xdelta). – torek Apr 12 '17 at 08:14
  • 3
    I'm afraid this is not at all how Git stores data in the repository. The user interface provided by `git add -p` does not relate to the actual on-disk repository storage format. Your description of "pointers to hunks" does not, to my knowledge, actually occur in Git. For these reasons I must downvote your answer. – Greg Hewgill Apr 12 '17 at 08:22
  • 1
    If you want to see the actual storage objects don't use `git show` use `git cat-file -p` instead, this doesn't mangle the output by trying to figure out what you probably want to see but shows the actual underlying file (it decompresses it though but it doesn't do diffs or similar). – Lasse V. Karlsen Apr 12 '17 at 08:42
  • 1
    And yes, this answer is completely wrong if talking about the *storage model*. The different git commands will *show* diffs and things like that but the underlying storage model does not store diffs "as such". A packfile might use delta-compression but this is somewhat the same but also very different as this is on the binary level. – Lasse V. Karlsen Apr 12 '17 at 08:55
  • 1
    The same disconnect will happen if you come across discussions that conclude that "git track content, not files". Git actually track files but the various git commands **on top of** the stored files will **analyze** across files. So for instance, if `git blame` shows lines that come from a different file then this is entirely `git blame`, **not** stored in the repository in any way. – Lasse V. Karlsen Apr 12 '17 at 08:59
  • 1
    Thank you all for clarifing all. Appricated – CodeWizard Apr 12 '17 at 08:59