3

I have a git repository and I am instructed to perform the following sequence of actions:

  1. Copy a given set of files from a folder to the above mentioned git repository (the "source folder" is not a part of the repository).
  2. Execute git add .
  3. Execute git exile push folder_name/
  4. Execute git commit -m 'Commit message'

Now I want to understand what I am actually doing. To be more specific, the first two steps are clear to me (I changes something in the repo and then I add this changes to the "staging area", so it is ready for git commit). However, the last two steps (3 and 4) are confusing and I have the following questions about them:

  1. Usually we commit and then push. Why do we do it differently here (first push and then commit)?
  2. Instead of git push we use git exile push. What is the difference between these two? Where does it push to? What does it push?

I heard that it has something to do with large files. Instead of using them "explicitly" we work with their "references" (or "links" to them). But what does it exactly mean?

ADDED

I assume that git exile push takes big files, copy their content to a location that is suited for holding larger files and then it replace the content of the original files by the link to their copies. So, in other words, the content of the files will be replaced by the link to the copy of their content. After that git exile push executes git add. So, it changes the files, it adds them to the staging area and the only thing that I need to do is git commit.

Is my interpretation correct and complete?

antzshrek
  • 9,276
  • 5
  • 26
  • 43
Roman
  • 124,451
  • 167
  • 349
  • 456
  • 1
    https://github.com/patstam/git-exile#how-it-works – ElpieKay Mar 17 '18 at 10:11
  • Refer to [this](https://github.com/patstam/git-exile#push) and [this](https://stackoverflow.com/questions/2745076/what-are-the-differences-between-git-commit-and-git-push). – amanb Mar 17 '18 at 10:11
  • @ElpieKay, the linked document assumes a quit advanced level of reader who knows all the terminology. I would like to have a basic understanding of what happens expressed in simple terms (on the same level of simplicity as the explanation that I put after the ADDED). – Roman Mar 17 '18 at 10:57

1 Answers1

2

git exile is not part of Git. It's pretty clear from ElpieKay's link that it is similar in some ways to Git-LFS (which is also not part of Git), and which is what you described in your "added" section:

I assume that git exile push takes big files, copy their content to a location that is suited for holding larger files and then it replace the content of the original files by the link to their copies. So, in other words, the content of the files will be replaced by the link to the copy of their content.

This is correct in terms of goals, but not in terms of underlying mechanism.

For Git-LFS the goal is based on file size, and Git-LFS has a lot of code in it that make this work. For Git-Exile (which I have not used, nor examined in fine detail—I did a quick eyeball of the code) the goal is based on "binary-ness" rather than size, and you must choose which files to claim are binary by name-pattern. That is, you might say *.jpg and/or *.exe are to be treated as binary.

Now let's take on the details.

Your work-tree, your commits, and your branch names

You already know that Git's commits store files ("snapshots"). If you don't already know this, go read something that describes how that part works. To keep things small-ish, Git stores the files in a special, Git-only form that only Git can deal with. You need to have the files in a non-Gitty form so that you can work with them. So Git copies the files out of the snapshot into a work-tree, which is the area where you do your work.

But now consider this rather stark fact: Commits are entirely read-only. You can never change the contents of any existing commit. You can read them out any time you like. You can make a new (and different) commit, leaving the existing commits alone. You can't change a commit, ever.

Each commit is identified by a big, ugly, apparently-random hash ID like e3a80781f5932f5fea12a49eb06f3ade4ed8945c (this is a commit in the Git repository Git itself). These IDs are basically unusable by humans, so we pick some important commit, such as the most recent commit on a branch, and give it a name like master. The name-to-commit-hash will change over time: every time we add a new commit to a repository, Git will assign it a new, unique hash ID. If we just added that new commit to the master branch, Git will store the new ID into the name master, so that the name always identifies the latest commit!

Each commit, once made, is fixed forever. It also stores the hash ID of the previous commit (and stores that forever since nothing can change the commit). So using the most recent commit, which we find by the name master, we can work backwards to find an earlier commit:

      <-C   <--master

We just follow the arrow (the hash ID) coming out of commit C to find the earlier commit:

  <-B <-C   <--master

Now there's an arrow (a parent hash ID, really) coming out of B too, so we find the earlier commit:

A <-B <-C   <--master

and in our tiny example repository, there are only three commits: A is the first one ever made, so it has no parent arrow / hash-ID, and we know we can stop chasing parent links.

The work-tree is pretty straightforward, but it's not a commit, and a commit is not the work-tree. Git can extract a commit into the work-tree, and—eventually, sort of—save a work-tree into a new commit, but to do so, Git insists on going through its index. Other version control systems don't have an index, or if they do have something that works like the index, they keep it completely hidden and you don't have to know about it. Git goes the opposite direction.

The index

This all means that whenever you work with Git, you must be aware of, and use, what Git calls the index, or—depending on who is doing the calling and what they want to emphasize—the staging area or the cache. These are three names for one single thing. That one thing is so important that it winds up with these three names! Well, that, or the first one, "index", is such a terrible name... :-) Seriously, though, the index is constantly getting in your face and making you understand that it stands between you and your commits.

To put it as simply as possible, Git's index contains the files that will go into the next commit you make. This means that the index starts out holding all the files that are in the current commit.

When you run git commit, Git packages up whatever is in the index right now, and makes a new commit from those files, however they appear in the index right now. The index might have different stuff in it later, but at the time you run git commit, Git takes what's in it, packages it up, and makes a new commit.

The new commit points back to the current commit. So if we have our simple three-commit repository as above:

A--B--C   <-- master (HEAD)

and we make a new commit D while our HEAD is attached to branch master so that the current commit is commit C, the new commit will point back to C, and Git will make the name master point to D:

A--B--C--D   <-- master (HEAD)

and that's how branches grow.

So how do you get files into the index?

Since this index-aka-staging-area is so important, you need to know how to get files into the index. Sure, it starts out with files from the current commit, courtesy of git checkout, but then what?

The what part is mostly git add. Running:

git add README.txt

tells Git to package up the contents of README.txt from your work-tree, turn it into special Git-only format, and stuff that into the index under the name README.txt.

This means that the file-flow, in Git, goes like this:

    commit  —>  index  <—>  work-tree

Using git checkout, you copy files from some commit—usually the current commit—into the index, where they keep their special Git-only format but now become write-able; and then from the index to the work-tree, where they turn into normal format. Using git add, you copy files from the work-tree into the index, overwriting the copy that was there before and turning the file back into the special Git-only format.

Eventually, you run git commit to package up the index into a commit. The commit saves whatever is in the index, which is already converted into a Git-only format, so this part is really easy. Git just makes sure that the file sticks around forever as part of the commit, i.e., that a future git add that overwrites the index version doesn't overwrite or throw out the committed version. The underlying mechanism used for Git-only format (hashing with "garbage collection") makes this trivial.

Smudge and clean filters

There's an interesting point hidden in all of the above: Git has to copy files from the index into the work-tree, expanding out the Git-only format to normal format. And, Git has to copy files from the work-tree into the index, compressing them down into Git-only format. What if we did something sneaky during the copying?

Git provides its own internal filters here, such as doing CR-LF line endings instead of LF-only line endings, or expanding $Id$ to contain a hash ID. These filters mean that what's in the index and what's in the work-tree no longer actually match up. The index version of the file isn't just a compressed version of the work-tree file. It's a modified version, or a replacement version.

This is how both Git-LFS and Git-Exile work. They add filters that operate during the "extract from index to work-tree" step, and that operate during the "compress from work-tree into index" step. These filters, rather than just swapping CRLF and LF-only endings or expanding or compressing away $Id$ strings, actually swap the entire file contents.

During git add, the large or binary file never goes into the index at all. The LFS or Exile filter saves the real file somewhere else, and puts a link into Git instead. Git calls this a clean filter: it cleans up the icky work-tree file into a nice clean index version.

During git checkout, the large or binary filter isn't in the index, but the LFS or Exile filter takes the link and finds the real file from somewhere else, and puts that file into the work-tree for you. Git calls this a smudge filter: it take the nice clean committed version out of the index and dirties it up to make the icky work-tree file.

The mechanism for invoking smudge and clean filters is that you put file name glob patterns into a .gitattributes file, and with a filter= directive. This is described in the gitattributes documentation under the filter section. Git-LFS works by filtering every file, checking the file size, using Git's long running filter process trick to reduce overhead. Git-Exile works by matching just the interesting files, using the much simpler per-file filter method.

When (and where) should the moved files be saved?

Usually we commit and then push. Why do we do it differently here (first push and then commit)?

With Git-LFS, the large files that aren't in the index are sent to the Large File Server right away. With Git-Exile, the large files are stuffed into a secondary repository (if I read the code and description correctly).

The git exile push step pushes the moved files to the associated secondary repository. You don't necessarily have to do this first, it's just a good idea in case someone grabs your linking objects before you get a chance to do it. (That someone could even be you. The work-tree files are still there, but if you invoke your smudge filter on the index entry that has only the link, it will look for the moved files.)

Summary

Now we can see how this is right in terms of idea, but wrong in terms of execution:

I assume that git exile push takes big files, copy their content to a location that is suited for holding larger files and then it replace the content of the original files by the link to their copies. So, in other words, the content of the files will be replaced by the link to the copy of their content.

The replacement actually happens at git add time! The replacement of the link-only version, in the other direction, happens during git checkout.

torek
  • 448,244
  • 59
  • 642
  • 775