when git core.autocrlf is set to input what do we really have

Question

I have several shell scripts in a project. In windows, every time I have it in LF and commit, git client would turn local files into CRLF: I believe both local and git repository have the same CRLF edition files. Then I changed core.autocrlf to input and commit again, what is in local and what is in git repository? I have this question is because of what I have observed: when core.autocrlf is not configured:

change CRLF windows script into LF, git status shows that I have these and only these files modified.
git add ., git gives a warning, and local files are in CRLF again. But git status shows local branch is up to date with the remote branch.

Then I have core.autocrlf configured to input:

change CRLF windows script into LF, git status shows that I have these and only these files modified.
git add ., no warning, and local files are still in LF. But git status shows local branch is up to date with the remote branch.

Then the question is, in both cases, local branch is up to date with remote branch. WHAT IS IN REMOTE BRANCH? LF or CRLF

core.autocrlf is not configured:

core.autocrlf = input:

TLDR; `git config --global core.autocrlf false`: https://stackoverflow.com/a/1250133/6309 — VonC, Mar 27 '19 at 08:16

score 3 · Answer 1 · answered Mar 27 '19 at 06:59

(You can tell what's in the commits, but not the way you're going about it. You'll have to look directly into the commits, using low level tools. In general—but not always—what's in the commits is LF-only.)

You're mixing together some concepts that you need to keep separate. These concepts are commits, which is what Git is really for, and the work-tree and the index, which is how you go about having Git make commits. I'm going to go through all of these pretty fast, because we have to have a lot of shared terminology and understanding before we can get into the details of how CRLF vs LF-only line endings really work.

Commits, branches like `master`, and remote-tracking names like `origin/master`

Remember that Git is all about commits. Each commit has its own unique hash ID. That hash ID is, in effect, the true name of the commit. The commit itself represents a permanent and immutable¹ snapshot of a set of files, along with some metadata, such as the name and email address of whoever made the commit, the reason they made it (their log message), and the raw hash ID of the commit's parent commit.

Because each commit records the hash ID of its parent, we can, from any commit, work backwards to its parent. We say that this commit points to its parent. We can draw this situation. If we let a single uppercase letter stand in for a real hash ID (because real hash IDs are too big and ugly for humans to remember and use), we can draw a small simple three-commit repository like this:

A <-B <-C

Here commit C is the last commit we made. It records the hash ID of its parent commit B, so that C points to B. That allows Git to use the hash ID to find the actual commit B itself, and B contains the hash ID of—or points to—commit A. That allows Git to extract A. A is a special case: it's the very first commit, so it has no parent. This lets Git stop working backwards from commit to commit.

Note, though, that we need to save the actual hash ID of C somewhere. We don't need to save the hash ID of B because C is saving it for us, but we have to find C. Actual hash IDs seem random (even though they're not) so we have to write the hash ID of C somewhere. We could jot it down on paper, or on a whiteboard, but that's silly: why not have Git save it for us? So that's just what we do. That's what a branch name is: it's a place to save one (1) hash ID.

When we save C's hash ID in the name master, we say that master points to C:

A <-B <-C   <-- master (HEAD)

We can share these commits with another Git. Our Git and their Git will always use the same hash IDs (see footnote 1), so they have the exact same three commits. But they have their own branch names. Their master is theirs. At the moment, theirs also points to (shared) commit C:

A--B--C   <-- master (HEAD) [in their Git]

Our Git calls up their Git and has a conversation. Our Git and their Git realize we both have the same three commits. Then our Git reads their name master and saves it in our own Git repository, but changes it so that it doesn't interfere with our master:

A--B--C   <-- master (HEAD), origin/master

Now let's make a new commit in our own repository. The new commit gets some big ugly hash ID, which is unique to our new commit; we'll call this D. The special thing about branch names is that when we make a new commit while on some branch, Git writes the new commit's hash ID into the branch name, so that the branch name automatically points to the new commit:

A--B--C   <-- origin/master
       \
        D   <-- master (HEAD)

(This HEAD that I'm drawing in is how Git knows which branch name to update. As long as we only have one branch, we don't really need it, but as soon as we have more than one branch, we will need it.)

Now suppose that someone controlling the other Git repository adds a new commit to their master. This new commit will have a different hash ID from every other commit, so we'll call it E. Their master will now point to their E:

A--B--C--E   <-- master (HEAD) [in their Git]

Now we'll have our Git call up their Git and obtain any commits they have that we don't—which in this case is just commit E—and update our origin/master, which our Git is using to remember their master, to point to E:

A--B--C--E   <-- origin/master
       \
        D   <-- master (HEAD)

Let's make two more commits in our own repository now and call them F and G:

A--B--C--E   <-- origin/master
       \
        D--F--G   <-- master (HEAD)

When git status tells you that your branch is ahead 3, this is what it means: we have three commits on our master that they don't have on their master (that we're remembering as our origin/master). When git status tells you that your branch is behind 1, this is what it means too: they have one commit on their master (our origin/master) that we don't have on our master.

This is all that git status means by ahead or behind: that we have commits that they don't, or vice versa, or both.

Commits can, in some cases, be forgotten, and eventually they will go away and that hash ID will no longer have any meaning. But until they do go away, the commit is effectively permanent. It's entirely immutable, for the simple reason that the hash ID is a cryptographic checksum of the contents of that commit. If you attempt to change anything—even a single bit—what you get is a new, different commit with a different hash ID. The original commit remains unchanged. So all commits are quite literally immutable.

The index and the work-tree

Commits are immutable. They're frozen forever in time: the snapshots inside each commit can never be changed, not one bit. They're also stored in a special compressed Git-only form, sort of freeze-dried as it were, so as to take less space. That's fine for archiving—it lets you go back and see what you had yesterday, or last week, or whenever—but it's of no use at all in getting any new work done. If you can't change any files, what good is Git? Moreover, if they're all Git-only, how will you ever use them?

Of course, Git lets you make new commits—but to make new commits, you still need to change some files. Well, that, or remove some, or add some new ones, or any combination of these. So Git has to have a way to let you take an existing commit and rehydrate it, getting all its files out into useful form where you can see them and work on them.

The place where you can see and work on your files is the work-tree. When you run git checkout master, you're telling Git: Get all the files out of the commit to which the name master points. (This also attaches HEAD to the name master, so that Git knows which name to update when you make the new commits.) The extracted files go into your work-tree, where you can see them, use them, change them, and so on.

Git could stop here, and other systems do stop here. The current commit and the work-tree are all you really need. But Git doesn't quite stop here. Instead, in between the current commit, which is read-only and has freeze-dried Git-only files in it, and the work-tree, Git inserts a sort of halfway point that Git calls, variously, the index, or the staging area, or the cache. All three names mean the same thing. Which name gets used depends on who or which part of Git is doing the calling.

What's in the index is, at least initially, all the files from the commit. That is, Git effectively copies the freeze-dried files from the commit, to the index, before copying them on to your work-tree. Then it rehydrates the files, copying from the index to the work-tree.

If you have modified the work-tree copy of a file, you must copy it back into the index in order to commit the result. You do this with git add, which dehydrates (compresses and Git-ifies) that file and overwrites the previous index copy. When you later run git commit, Git takes whatever is in the index at that time and puts that into the new commit.

Again, this is all critically important: Git extracts any existing commit into the index and builds a new commit from the index. Git does not build the commit from what's in your work-tree: the work-tree is for you, not for Git. The committed copies of files are in the special Git-only format: freeze-dried, as it were. The index copies of files are also in this special Git-only format. (This is what makes git commit so fast: it doesn't have to freeze-dry every file; every file is already freeze-dried, ready to go!) The work-tree copies ... well, this is where CRLF and LF-only line endings come in!

We finally get to talk about line endings

Because internal (committed and index) files are in a different format, Git has an opportunity to make special changes. Whenever Git is copying a file from the index to the work-tree, Git can replace the LF-only line endings that Linux prefers with the CRLF line endings that Windows prefers. Whenver Git is copying a file from the work-tree to the index, it can do the reverse. This is precisely how it all works. Nothing happens to any committed file. Nothing can happen to such a file, because commits are immutable. But by changing the conversion settings, you can make what goes into the index, or what comes out of the index, be or look different from what you get to see and work with in your work-tree.

Telling Git: File A.txt should have CRLF endings in the work-tree tells Git to change LF-only to CRLF on the way out of the index, and CRLF to LF-only on the way from the work-tree into the index. So when git checkout copies the file to the work-tree (from the index), LF becomes CRLF, and when git add copies the file from the work-tree (to the index), CRLF becomes LF.

You can tell Git: Don't change A.txt when copying from index to work-tree, but when copying from work-tree to index, do replace CRLF with LF-only. This is the mode called input. When git checkout does the index -> work-tree conversion, it doesn't do anything special, but when git add does the work-tree -> index conversion, it replaces CRLF with LF-only.

There's a hitch

There is one big problem with this technique. It does work, and that really is how Git does things. But Git was originally built for Linux, where you never want any of this fiddling. Your files are all just data; Git has no business changing them; and Git was designed to work this way. The part of git status that tells you:

Changes not staged for commit

works by comparing what's in the index and what's in the work-tree. If you're having Git fuss with line endings, those copies won't match up. Git has to pretend that they do match up, as long as it's Git that did the line-ending fiddling, and that's still the only actual difference.

Hence, git status deliberately lies. If Git made the index and work-tree different due to line-ending settings, git status will try to tell you that the index and work-tree are the same. This automatic lying does not work in every case. In particular, if you change the conversion settings, Git may, or may not, notice.² If you change other things—including some of the system time data of the files—Git will think that the files are changed.

In this case, you're seeing the latter effect. You have touched the files in some way, so that Git doesn't just lie and say they are the same. Then you run:

git add .

Git carefully copies the work-tree files back into the index, doing the CRLF-to-LF-only conversion if required. The result is a freeze-dried index copy that matches the HEAD copy. Git now updates the cached system data (stat data as in footnote 2) in the index, so that git status knows to print the correct lies, or—if the work-tree copies really are LF-only now—the truth: that the HEAD copy, the index copy, and the work-tree copy of the file all match.

²The details depend on the internal details of the index, in its cache aspect: it saves the stat data from the file in the index, and if the stat data is unchanged since the last index-update, Git assumes the file is unchanged from the way Git set it up.

How can you see what's really in the commit?

There are several ways to see the original data unmolested by any LF-to-CRLF transformation. The most direct is to use git cat-file -p, which will pretty-print the internal storage form of a file (or of an index freeze-dried file for that matter). For instance:

git cat-file -p HEAD:A.txt

extracts what's really in A.txt in the current commit.

Note, however, that even your own computer's programs that transcribe this data into a window, so that you can see it, may modify the data. (In a similar vein, on a Linux system, using vim on a file with CRLF line endings hides the fact that it has CRLF endings from the Unix Linux user. You won't see them—but they'll still be there when you write the file out again!)

You may need a special viewing program that deliberately doesn't make the data "more user friendly", but instead makes it programmer-friendly. For instance, Linux has hexdump -C:

$ echo foo | hexdump -C
00000000  66 6f 6f 0a                                       |foo.|
00000004

Running the output of git cat-file -p on a Git internal blob (blobs are how Git freeze-dries files) through hexdump -C can be useful here. What the Windows equivalent of hexdump -C might be, I have no idea.

I was thinking a way to see file md5 in index, so I could compare it with that of the local file (work tree). — Tiina, Mar 28 '19 at 08:30