3

git clone git@github.com:erocarrera/pydot (35a8d858b) in a Debian with git config core.autocrlf input shows:

modified:   test/graphs/b545.dot
modified:   test/graphs/b993.dot
modified:   test/graphs/cairo.dot

These files have CRLF line endings, for example:

$ file test/graphs/cairo.dot
test/graphs/cairo.dot: UTF-8 Unicode text, with CRLF line terminators

The .gitattributes file contains:

*.py eol=lf
*.dot eol=lf
*.txt eol=lf
*.md eol=lf
*.yml eol=lf

*.png binary
*.ps binary

Changing core.autocrlf has no effect on the status of these files. Deleting the .gitattributes has no effect either. Changing these files with dos2unix does not change their status (as expected), and back with unix2dos shows no difference with diff versus an older copy. File permissions look unchanged with ls -lsa. Also, the files have uniform line endings as far as I can tell with vi -b (thus it shouldn't be the case that unix2dos or dos2unix convert from mixed to uniform line endings, which could have explained this strange behavior). I'm using git version 2.11.0.

What does git think has changed?

Somewhat relevant:

  1. Git status shows files as changed even though contents are the same
  2. Files showing as modified directly after git clone
  3. Cloning a git repo, and it already has a dirty working directory... Whaaaaa?

I didn't find an answer that explains this behavior during my search over several discussions. This issue arose from pydot # 163.

In more detail:

git status

On branch master
Your branch is up-to-date with 'origin/master'.
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directory)

    modified:   test/graphs/b545.dot
    modified:   test/graphs/b993.dot
    modified:   test/graphs/cairo.dot

no changes added to commit (use "git add" and/or "git commit -a")

git diff test/graphs/b993.dot

warning: CRLF will be replaced by LF in test/graphs/b993.dot.
The file will have its original line endings in your working directory.
diff --git a/test/graphs/b993.dot b/test/graphs/b993.dot
index e87e112..8aa0872 100644
--- a/test/graphs/b993.dot
+++ b/test/graphs/b993.dot
@@ -1,10 +1,10 @@
-diGraph G{
-graph [charset="utf8"]
-1[label="Umlaut"];
-2[label="ü"];
-3[label="ä"];
-4[label="ö"];
-1->2;
-1->3;
-1->4;
-}
+diGraph G{
+graph [charset="utf8"]
+1[label="Umlaut"];
+2[label="ü"];
+3[label="ä"];
+4[label="ö"];
+1->2;
+1->3;
+1->4;
+}

UPDATE:

Out of curiosity, I committed one of these files, dumped git log -1 -p > diff, and vi -b diff shows that git normalized

  1 commit 2021d6adc1bc8978fa08d729b3f4d565f9b89651
  2 Author:
  3 Date:
  4 
  5     DRAFT: experiment to see what changed
  6 
  7 diff --git a/test/graphs/b545.dot b/test/graphs/b545.dot
  8 index ebd3e8f..2c33f91 100644
  9 --- a/test/graphs/b545.dot
 10 +++ b/test/graphs/b545.dot
 11 @@ -1,9 +1,9 @@
 12 -digraph g {^M
 13 -^M
 14 -"N11" ^M
 15 -  [^M
 16 -  shape = record^M
 17 -  label = "<p0>WFSt|1571       as Ref: 1338    D"^M
 18 -]^M
 19 -N11ne -> N11:p0^M
 20 -}^M
 21 +digraph g {
 22 +
 23 +"N11" 
 24 +  [
 25 +  shape = record
 26 +  label = "<p0>WFSt|1571       as Ref: 1338    D"
 27 +]
 28 +N11ne -> N11:p0
 29 +}

Other weird observations: git checkout any of these files after cloning does not have any effect. After the above commit, the file b545.dot continued to have CLRF line endings in the working directory. Applying dos2unix followed by unix2dos didn't make git think that it has changed (whereas before the commit it did, probably because the committed file had CLRF line endings).

0 _
  • 10,524
  • 11
  • 77
  • 109

3 Answers3

5

This happens precisely because those files are committed with CRLF endings, yet the .gitattributes file says to commit them with LF-only endings.

Git can and will do CRLF-vs-LF-only conversion in two places:

  • During extraction from index to work-tree. A file stored in a commit or in the index is always assumed to be in a "clean" state, but when extracting that file from the index, to the work-tree, Git should apply any conversions directed by .gitattributes in the form of "change LF-only to CRLF", for instance, and also in the form of what Git calls smudge filters.

  • During the copy of a file from work-tree back to index. A file stored in the work-tree is in the "smudged" state, so at this point, Git should apply any "cleaning" conversions: for instance, change CR-LF to LF-only, and applying clean filters.

Note that there are two points at which these conversions can occur. This does not mean that they will occur at both points, just that these are the two possible places. As the .gitattributes documentation notes, the actual conversions are:

  • eol=lf: none on index -> work-tree; CR-LF to LF-only on work-tree -> index
  • eol=crlf: LF-only to CR-LF on index -> work-tree; none on work-tree -> index

Now, a file that's actually in the repository, stored in a commit, is purely read-only. It can never change inside that commit. More precisely, the commit identifies (by hash ID) a tree that identifies (by hash ID) a blob that has whatever contents it has. These hash IDs are themselves crytographic checksums of the object contents, so they are naturally all read-only: if we try to change the contents, what we get is instead a new, different object with a new, different hash ID.

Because git checkout actually works by coping the raw hash IDs from the commit's tree(s) to the index, the versions of files stored in the index are necessarily identical to those stored in the commit.

Hence, if somehow—regardless of the how—the committed files are in a form that disagrees with what .gitattributes directs Git to do, the files will become "dirty" in the work-tree regardless of the fact that you haven't done anything to them! If you were to git add the three files in question, that would copy them from work-tree to index, and hence delete the carriage-returns from their line endings. Hence they are, in git status terms, modified but not yet staged for commit.

Stripping out the carriage returns in the work-tree versions leaves them in the same state: they're modified with respect to what's in the index, because git add will now leave their LF-only line endings unchanged, producing new, different files that are in the index.

A more interesting question is: How did they get into the commit(s) in the wrong state? This is not something we can answer: only those who made those commits can produce that answer. We can only speculate. One way to achieve this is to add and commit the files without a .gitattributes in effect, then to set the .gitattributes into effect without git add-ing the files again. This way, the CR-LF endings get into someone's index and hence get into that user's commits, even though the .gitattributes file now says (but did not earlier say) that any new git add should strip away the carriage returns.

torek
  • 448,244
  • 59
  • 642
  • 775
  • To answer how this happened: indeed the test files were committed [in 2010](https://github.com/erocarrera/pydot/tree/2b3f0885530ecfa8391c65583b2153a64dbc3e58) in absence of a `.gitattributes`, and I added a `.gitattributes` [in 2016](https://github.com/erocarrera/pydot/commit/b1aef4db0af2543c8987e9985f8c4d2a1b455c0b). I was likely aware of the files having CRLF line endings, but at that point did not have any reason to not preserve the repository. – 0 _ Nov 26 '17 at 04:16
  • Interestingly, (I think) `git` did not show them as changed in the working directory of the repo where I first added `.gitattributes`, until they were touched by `unix2dos` (with their content remaining unchanged). I suspect that this may have been due to `git` somehow rescanning them, now taking `.gitattributes` into account (in that case I first cloned, then added `.gitattributes`, so perhaps `git` did not "reconsider" those untouched files in light of the newly added `.gitattributes`). – 0 _ Nov 26 '17 at 04:18
  • 1
    Yes: there is a complicated dance that Git does with file modification time stamps to avoid an expensive comparison. This is the "cache" aspect of the index. Touching the files updates their time stamps, invalidating the cache, leading to comparing the work-tree file data with what *would* be saved in the index if the files were added, leading to the "modified" status. – torek Nov 26 '17 at 06:47
2

Changing core.autocrlf has no effect on the status of these files

It should, but only after cloning again:

git config --global core.autocrlf false

git clone git@github.com:erocarrera/pydot pydot2
cd pydot2
git status

That would desactivate core.autocrlf globally, but this is just for testing here.

VonC
  • 1,262,500
  • 529
  • 4,410
  • 5,250
  • Good point (I was forgetting to pass `--global`), although after the above, `git` still says that the files have changed (I did confirm that `core.autocrlf` changed in my `~/.gitconfig`. – 0 _ Nov 25 '17 at 23:39
  • 1
    @IoannisFilippidis sure, but that setting needs to be followed by a git clone, to see its effect in the new cloned repo. – VonC Nov 25 '17 at 23:41
  • Indeed, I did clone again after changing the configuration, but the files are shown as changed. – 0 _ Nov 25 '17 at 23:43
  • I have updated the question with the output of `git diff` and `git status`. – 0 _ Nov 25 '17 at 23:48
  • @IoannisFilippidis Then adding a .gitattribute directive for *.dot, as suggested in https://github.com/erocarrera/pydot/issues/163#issuecomment-346963988, seems the right course of action. – VonC Nov 25 '17 at 23:51
  • Actually, that `.gitattribute` directive for `*.dot` already exists in the commit I am checking out when cloning: https://github.com/erocarrera/pydot/blob/master/.gitattributes (the "would be" in that comment was conjecturing about what was happening on Travis CI, which turned out to be checking out CRLF). – 0 _ Nov 25 '17 at 23:54
  • @IoannisFilippidis so it appears this (CRLF) is intentional. Instead of trying to change them, can you try and resolve https://github.com/erocarrera/pydot/issues/163#issuecomment-346966843, namely re-transform those files, in order for your build to succeed? – VonC Nov 26 '17 at 00:03
  • The tests of `pydot` pass with both LF and CRLF. I do not know what motivated https://github.com/erocarrera/pydot/issues/164, and no specific error has yet been reported there. I was not the committer of those files, so not sure whether the line endings were intentional, but I think so. Nevertheless, the present issue is orthogonal, in that it is strange for `git` to report changes where there aren't any. My conjecture is that `git` warns it will normalize. I added an experiment to the question. – 0 _ Nov 26 '17 at 00:20
1

Thanks to @torek for the explanation (which agrees with my conjecture).

In summary, the asymmetric git configuration leads to commit(checkout(Index)) not being the identity mapping. With CRLF in the index, this particular configuration checked out CRLF, but after the input transformations in effect (eol=lf), git would commit LF instead of CRLF.

The root cause of this confusion was comparing the:

  • file I see in the working directory, with the
  • committed file.

This doesn't show whether the file has changed. What one should compare is what git will commit after applying the input transformations with what is already committed. Clearly, if those two items differ, then the file has changed.

Following this reasoning, one could declare the repository "unstable", in that it regards itself as modified in absence of interaction with the world. This supports avoiding this state by changing the committed files to LF, or changing the .gitattributes (I prefer committing LF).

In this situation, git would commit LF for both LF and CRLF in the working directory, so dos2unix and unix2dos would had no effect on the commit outcome, thus neither to the file's status.

0 _
  • 10,524
  • 11
  • 77
  • 109
  • Nice feedback, more complete than my answer. +1 – VonC Nov 26 '17 at 08:54
  • Yet another reason why the repository cannot remain in this state: one cannot rebase while there are unstaged changes for commit. – 0 _ Dec 25 '17 at 13:56