How git stores the actual commited files vary over the lifetime of your repository but let's begin with the basics.
When you commit a file to your repository, a new file, a complete copy of this file is made. The SHA1 is calculated from its contents, and this is the "object id" of this file.
You can find this file under .git\objects\SH\A1-hash
The SH\A1-hash
there is my way of indicating that the first two characters of the SHA1 is used as a folder name and the 38 rest is used as the filename inside that directory.
Then you modify this file, add it to the index, and commit it.
This is again stored as a completely new file indexed the exact same way as above.
This is very easy to test but bear in mind that whenever you make a commit that changes 1 file you get 3 git objects:
- The new version of the file
- A "tree" object, indicating which version of every file in your index to use for this particular commit
- The commit object, storing references to its parent(s) and the tree.
So yes, git stores files as complete snapshots. Note that these files are compressed, so they're not taking up quite as much space as two complete copies of this file but they're taking up as much space as two complete compressed copies of this file.
If the file being added doesn't lend itself to compression very well (think jpg, png or zip files), then yes, this will take up a lot of space.
At some point Git may decide to pack your repository, and here Git may decide to use delta-compression (compress and store the differences between files) inside this packfile. However, the rest of Git doesn't see this as this is an abstraction on top of the underlying file access inside Git. The various Git commands implementations will still see the "un-deltified" (if there is such a word) files.
Now, various commands will invariably hide this from you because most of the git commands you use, if implemented well, hides all the underlying abstractions and optimizations from you, the developer, and instead focuses on what you probably want to see.
So if you look at these files, some of the commands will show diffs, where the underlying files aren't stored as diffs, simply because a diff makes more sense to you, the developer.
If you instead go and use the plumbing commands, you will see more of the blobs.
If you want to see how all this work out in practice there is just 1 command you need to know, and that is git cat-file -p SHA1
.
Here's a way to test this:
- Initialize a new repository
- Add a file and commit it
- Execute
git log
and copy the SHA1 of the commit
Execute git cat-file SHA1-of-commit
and you will see something like this:
tree d7d68c5b2ecc58da225c953e35b0797a4805b844
author Lasse Vågsæther Karlsen <lassevagsaether.karlsen@visma.com> 1491986419 +0200
committer Lasse Vågsæther Karlsen <lassevagsaether.karlsen@visma.com> 1491986419 +0200
First copy
Now make a copy of the SHA1 id after tree
, this is the object id of the tree object, then execute git cat-file SHA1-of-tree-object
, and you will see something like this:
100644 blob 3b5d02884e6a17f20ed7938bf9e534f1bd0d195e Temp.7z
This tells you that the index contains 1 file (1 line), with the filename Temp.7z
, and it tells you its SHA1 id. Copy this id.
- Execute
git cat-file -p SHA1-of-blob
and you will see the contents of the file you added.
The storage model of Git is not magical or complex at all, but there are lot of optimizations and abstractions in there to avoid wasting space, de-duplication, and so on.