OP: What does snapshot in Git mean? Is is true that Git makes a copy of all the files in each commit?
What does snapshot in Git mean?
In Git, all commits are immutable snapshots of your project (ignored files excluded) at a specific point in time. This means that each and every commit contains a unique representation of your entire project, not just the modified or added files (deltas), at the time of commit. Apart from references to the actual files, each commit is also infused with relevant metadata such as commit message, author (inc. time stamp), committer (inc. timestamp), and references to parent commit(s); all of which are immutable!
Since the commit (or commit object as it is formally called) is immutable in its entirety, trying to modify any of its content isn't possible. Commits can never be tampered with or modified once they are created!
How Git store files internally
From the Pro Git book we learn that:
Git is a content-addressable filesystem. Great. What does that mean? It means that at the core of Git is a simple key-value data store. What this means is that you can insert any kind of content into a Git repository, for which Git will hand you back a unique key you can use later to retrieve that content.
So let's look at below illustration to figure out what above statement really means, and how Git store data (and particularly files) internally.
A simple commit history containing three commits, including an overview of how the actual data (files and directories) are stored inside Git. On the left hand side the actual snapshot is displayed, with the "delta change" compared to previous commit highlighted in green. On the far right are the internal objects used for storage.
Git makes use of three main objects in it's internal storage:
- Commit object (High-level snapshot container)
- Tree object (Low-level filename/directory container)
- Blob object (Low-level file content container)
To store a file inside Git in a general sense (e.g. content + filename/directory) one blob and a tree is needed; the blob to store just the file content, and the tree to store the filename/directory referencing the blob. To construct nested directories, multiple trees are used; a tree can hence reference both blobs and trees. From a high-level perspective you don't have to worry about blobs and trees as Git creates them automatically as part of the commit process.
Note: Git computes all hashes (keys) bottom up, starting with the blobs, moving passed any sub trees, ultimately arriving at the root tree – feeding the keys as input to it's direct parents. This process produces the structure visualized above which is known in mathematics and computer science as a Directed Ascyclical Graph (DAG), e.g. all references moves in one direction only without any cyclical dependencies.
Analyzing the visualized example a bit further
By scrutinizing above history we see that for the initial C0 commit two empty files were added, src/index.js
and .gitignore
– but only one blob got created! That's because Git only stores unique content, and since the content of the two empty files obviously resulted in the same hash: e69de
– only one entry was needed. However, as their filenames and paths differed two trees got created to keep track of this. Each tree returning a unique hash (key) computed based on the paths and blobs it's referencing.
Continuing upwards to the second commit C1, we see that only the .gitignore
file got updated producing a new blob (e51ac
) containing that data. As far as the root tree goes it still makes use of the same sub tree reference for the src/index.js
file. However, the root tree is also a brand new object with a new hash (key) simply because the underlying .gitignore
reference changed.
In the final C2 commit only the src/index.js
file got updated and a new blob (257cc
) emerged – forcing the creation of a new sub tree (5de32
), and ultimately a new root tree (07eff
).
In summary
Everytime a new commit is created, a snapshot of your entire project is recorded and stored to the internal database following a DAG data structure. Whenever a commit is checked out, your Working Tree is reconstructed to reflect the same state as the underlying snapshot is referencing through the root tree.
Source: Above excerpt is taken from this full length post on the subject: Immutable Snapshots - One of Git's Core Concepts