git exile
is not part of Git. It's pretty clear from ElpieKay's link that it is similar in some ways to Git-LFS (which is also not part of Git), and which is what you described in your "added" section:
I assume that git exile push
takes big files, copy their content to a location that is suited for holding larger files and then it replace the content of the original files by the link to their copies. So, in other words, the content of the files will be replaced by the link to the copy of their content.
This is correct in terms of goals, but not in terms of underlying mechanism.
For Git-LFS the goal is based on file size, and Git-LFS has a lot of code in it that make this work. For Git-Exile (which I have not used, nor examined in fine detail—I did a quick eyeball of the code) the goal is based on "binary-ness" rather than size, and you must choose which files to claim are binary by name-pattern. That is, you might say *.jpg
and/or *.exe
are to be treated as binary.
Now let's take on the details.
Your work-tree, your commits, and your branch names
You already know that Git's commits store files ("snapshots"). If you don't already know this, go read something that describes how that part works. To keep things small-ish, Git stores the files in a special, Git-only form that only Git can deal with. You need to have the files in a non-Gitty form so that you can work with them. So Git copies the files out of the snapshot into a work-tree, which is the area where you do your work.
But now consider this rather stark fact: Commits are entirely read-only. You can never change the contents of any existing commit. You can read them out any time you like. You can make a new (and different) commit, leaving the existing commits alone. You can't change a commit, ever.
Each commit is identified by a big, ugly, apparently-random hash ID like e3a80781f5932f5fea12a49eb06f3ade4ed8945c
(this is a commit in the Git repository Git itself). These IDs are basically unusable by humans, so we pick some important commit, such as the most recent commit on a branch, and give it a name like master
. The name-to-commit-hash will change over time: every time we add a new commit to a repository, Git will assign it a new, unique hash ID. If we just added that new commit to the master
branch, Git will store the new ID into the name master
, so that the name always identifies the latest commit!
Each commit, once made, is fixed forever. It also stores the hash ID of the previous commit (and stores that forever since nothing can change the commit). So using the most recent commit, which we find by the name master
, we can work backwards to find an earlier commit:
<-C <--master
We just follow the arrow (the hash ID) coming out of commit C
to find the earlier commit:
<-B <-C <--master
Now there's an arrow (a parent hash ID, really) coming out of B
too, so we find the earlier commit:
A <-B <-C <--master
and in our tiny example repository, there are only three commits: A
is the first one ever made, so it has no parent arrow / hash-ID, and we know we can stop chasing parent links.
The work-tree is pretty straightforward, but it's not a commit, and a commit is not the work-tree. Git can extract a commit into the work-tree, and—eventually, sort of—save a work-tree into a new commit, but to do so, Git insists on going through its index. Other version control systems don't have an index, or if they do have something that works like the index, they keep it completely hidden and you don't have to know about it. Git goes the opposite direction.
The index
This all means that whenever you work with Git, you must be aware of, and use, what Git calls the index, or—depending on who is doing the calling and what they want to emphasize—the staging area or the cache. These are three names for one single thing. That one thing is so important that it winds up with these three names! Well, that, or the first one, "index", is such a terrible name... :-) Seriously, though, the index is constantly getting in your face and making you understand that it stands between you and your commits.
To put it as simply as possible, Git's index contains the files that will go into the next commit you make. This means that the index starts out holding all the files that are in the current commit.
When you run git commit
, Git packages up whatever is in the index right now, and makes a new commit from those files, however they appear in the index right now. The index might have different stuff in it later, but at the time you run git commit
, Git takes what's in it, packages it up, and makes a new commit.
The new commit points back to the current commit. So if we have our simple three-commit repository as above:
A--B--C <-- master (HEAD)
and we make a new commit D
while our HEAD
is attached to branch master
so that the current commit is commit C
, the new commit will point back to C
, and Git will make the name master
point to D
:
A--B--C--D <-- master (HEAD)
and that's how branches grow.
So how do you get files into the index?
Since this index-aka-staging-area is so important, you need to know how to get files into the index. Sure, it starts out with files from the current commit, courtesy of git checkout
, but then what?
The what part is mostly git add
. Running:
git add README.txt
tells Git to package up the contents of README.txt
from your work-tree, turn it into special Git-only format, and stuff that into the index under the name README.txt
.
This means that the file-flow, in Git, goes like this:
commit —> index <—> work-tree
Using git checkout
, you copy files from some commit—usually the current commit—into the index, where they keep their special Git-only format but now become write-able; and then from the index to the work-tree, where they turn into normal format. Using git add
, you copy files from the work-tree into the index, overwriting the copy that was there before and turning the file back into the special Git-only format.
Eventually, you run git commit
to package up the index into a commit. The commit saves whatever is in the index, which is already converted into a Git-only format, so this part is really easy. Git just makes sure that the file sticks around forever as part of the commit, i.e., that a future git add
that overwrites the index version doesn't overwrite or throw out the committed version. The underlying mechanism used for Git-only format (hashing with "garbage collection") makes this trivial.
Smudge and clean filters
There's an interesting point hidden in all of the above: Git has to copy files from the index into the work-tree, expanding out the Git-only format to normal format. And, Git has to copy files from the work-tree into the index, compressing them down into Git-only format. What if we did something sneaky during the copying?
Git provides its own internal filters here, such as doing CR-LF line endings instead of LF-only line endings, or expanding $Id$
to contain a hash ID. These filters mean that what's in the index and what's in the work-tree no longer actually match up. The index version of the file isn't just a compressed version of the work-tree file. It's a modified version, or a replacement version.
This is how both Git-LFS and Git-Exile work. They add filters that operate during the "extract from index to work-tree" step, and that operate during the "compress from work-tree into index" step. These filters, rather than just swapping CRLF and LF-only endings or expanding or compressing away $Id$
strings, actually swap the entire file contents.
During git add
, the large or binary file never goes into the index at all. The LFS or Exile filter saves the real file somewhere else, and puts a link into Git instead. Git calls this a clean filter: it cleans up the icky work-tree file into a nice clean index version.
During git checkout
, the large or binary filter isn't in the index, but the LFS or Exile filter takes the link and finds the real file from somewhere else, and puts that file into the work-tree for you. Git calls this a smudge filter: it take the nice clean committed version out of the index and dirties it up to make the icky work-tree file.
The mechanism for invoking smudge and clean filters is that you put file name glob patterns into a .gitattributes
file, and with a filter=
directive. This is described in the gitattributes
documentation under the filter
section. Git-LFS works by filtering every file, checking the file size, using Git's long running filter process trick to reduce overhead. Git-Exile works by matching just the interesting files, using the much simpler per-file filter method.
When (and where) should the moved files be saved?
Usually we commit and then push. Why do we do it differently here (first push and then commit)?
With Git-LFS, the large files that aren't in the index are sent to the Large File Server right away. With Git-Exile, the large files are stuffed into a secondary repository (if I read the code and description correctly).
The git exile push
step pushes the moved files to the associated secondary repository. You don't necessarily have to do this first, it's just a good idea in case someone grabs your linking objects before you get a chance to do it. (That someone could even be you. The work-tree files are still there, but if you invoke your smudge filter on the index entry that has only the link, it will look for the moved files.)
Summary
Now we can see how this is right in terms of idea, but wrong in terms of execution:
I assume that git exile push
takes big files, copy their content to a location that is suited for holding larger files and then it replace the content of the original files by the link to their copies. So, in other words, the content of the files will be replaced by the link to the copy of their content.
The replacement actually happens at git add
time! The replacement of the link-only version, in the other direction, happens during git checkout
.