Git status: Show certain files as not modified so that people do not add/commit them

Question

Scenario/goal:

we use a 3rd party visual programming tool for some tasks
this tool modifies a lot of files even if we do not make any regular change (e.g. it refreshes a timestamp attribute inside the file when you open a file)
we do not want to commit files where only the timestamps attributes inside the xml files are changed
we use both Windows and Linux als development environment and different tools to handle the interaction with the git repository

Idea:

We have already written a small diff tool which can decide if the change was relevant or not relevant for us
We know how to configure "git diff" so that it uses our diff tool (via .gitattributes)

Problem:

is it possible to manipulate "git status" so that it uses the difftool and does not show files as modified - ideally independent of the operating system and the used git client/UI?

"*we do not want to commit files where only the timestamps are changed*" [If modification date is changed `git` does not decide the file is modified — it checks content](https://stackoverflow.com/a/60578554/7976758). — phd, Mar 16 '20 at 13:32
The timestamps are xml attributes inside the files. I updated the question. — herrjeh42, Mar 16 '20 at 13:52

torek · Answer 1 · 2020-03-16T19:11:12.783

There are several methods you can try here but clean filters are probably the way to go

Using a clean filter, you can strip out, or replace with fixed constants, or whatever you like, these "noise" timestamps.

Clean and smudge filters operate when files are copied between the index and your work-tree. In particular, the clean filters apply to all copies that transfer from the work-tree, to the index. The current or HEAD commit is available at all times if you wish to extract information from it (see the detailed explanation for the exception to this rule).

Essentially, what your clean filter would do is un-do the change made by your badly-behaved tool.

Background

Note that every commit contains a full and complete snapshot of all of your files. Commits are not changesets, they are snapshots. What you're asking for, in other words, is to commit something other than the files in your work-tree. This is always possible, because Git builds new commits from the index, not from your work-tree.

You have a rather ill-mannered tool, that changes some bytes inside each file every time you view the file. Let's take this out of the picture for a moment and just look at Git's commit and git checkout mechanism. Imagine you have a very small, newly-created repository with just three commits in it. These three commits will have some big ugly random-looking hash IDs, but for simplicity, we'll call them A, B, and C. For concreteness, we'll say that commit A has just a README.md file in it, and your files first go in at commit B; your files consist of f1 and f2, and what's in f2 is different in commits B and C.

Here's a drawing of the state in this tiny repository with the three commits and one branch named master:

A <-B <-C   <--master

The branch name master contains the actual hash ID of commit C.

Commit C contains some metadata, such as your name and the date-and-time-stamp for when you made commit C. The metadata also include your log message (from git commit -m or your editor, however you supplied it). Crucially, the metadata also contain the actual hash ID of earlier commit B.

Commit B contains metadata too: your name, when you made it, a log message, and the hash ID of commit A.

Commit A, being the very first commit, simply omits mention of any earlier commit. There isn't any earlier commit!

We say that the name master points to C, C points to B, B points to A, and A doesn't point anywhere.

We've said that A has just a README.md file in it. The file inside commit A is stored in a special, read-only, Git-only, frozen and compressed format. I like to call these frozen snapshots the freeze-dried versions of files.¹

Commit B shares its copy of README.md with commit A, since it's unchanged. Commit B has two more files in it, though: f1 and f2, your files.

Finally, commit C shares its copy of README.md with A and B, and shares its copy of f1 with A. Once Git has a frozen snapshot of some data, Git can just keep re-using the old snapshot. That's what new commits do: they all refer to frozen snapshots, and if the frozen snapshot is one that has never appeared in any other commit, well, that frozen snapshot is new to the repository now, but it can and will be shared later.

Just for completeness of the example, let's make a new branch dev right now, with master pointing to existing commit C. We'll get this graph:

A--B--C   <-- master, dev

Both names identify existing commit C. We now need one more thing in our picture, which is a way to know which branch name we're using. Both identify commit C, so whether we git checkout master or git checkout dev we'll get commit C, but that's only true right now. So we'll run git checkout dev to have Git attach the special name HEAD to one branch, like this:

A--B--C   <-- master, dev (HEAD)

Now we'll make a new that we call D, by creating or modifying some file, running git add on that file, and running git commit. Let's say we update f1 this time and run git add f1 and git commit to get D:

A--B--C   <-- master
       \
        D   <-- dev (HEAD)

The name dev now identifies commit D. Commit D points back to existing commit C—the one that you had out just a moment ago—and has for its snapshots, frozen copies of README.md, f1, and f2. The frozen README.md is still the same as in all earlier commits, and the frozen f2 is shared with C, while the frozen f1 is new to the repository—it doesn't match that in B or C.

Note that these frozen copies of files in commits are great for archival, but useless for getting any actual work done, because they literally can't be changed. So Git has to extract them somewhere, for you to work on. That somewhere is your work-tree, which contains ordinary files in ordinary (non-Git) format. These are files you can see and work with.

¹In a technical sense, it's not even inside commit A, to make it easier to share. Commit A simply refers to the frozen snapshot. The commit itself refers to a Git tree object that holds the file's name, and then the tree object refers to a Git blob object that holds the snapshot, in the freeze-dried format.

The frozen snapshots come from the index, not the work-tree

The reason for all of the above background is to illustrate where these frozen snapshots come from. When you run git commit, Git builds a new commit, but it does so from copies of your files that are in Git's index. You cannot see the index contents, not directly anyway,² but git status uses them to describe things.

At all times, Git has three copies of each file available, or more precisely, up to three copies. One is the frozen copy in the current or HEAD commit—the commit you extracted with git checkout. One is the regular file you see and work with. But between those two, Git keeps a third copy, in Git's index. (The index is also called the staging area, but I will stick with the term index here. See also the technical note in footnote 2.)

When you run git commit, Git doesn't look at your work-tree. Instead, Git just takes all the files that are in the index—which are already stored in the special read-only, Git-only, freeze-dried format—and puts those into the new commit it makes. This means that if you have changed the work-tree copy in some way, and want the updated one in the new snapshot, you must first copy the work-tree copy into index. This replaces the old index copy with a new one.

That's what git add does: git add means copy the work-tree file into the index. That's why you have to git add a file, even if it's not new. You have to update the index copy.

When you run git checkout <name>, Git turns the branch name into a commit hash, to find the actual commit. Then, if that's not the current commit, Git has to switch from whatever the current commit is, to that commit. Git must remove files from the index and your work-tree as needed—if the current commit has f3 for instance and the target doesn't—and copy the target commit's files into the index and your work-tree.

Hence, when you first switch to some branch, or first check out some branch, you get one particular commit—the one selected by the branch name—as the HEAD commit. That commit's freeze-dried files go into the index, so that the index matches that commit. Those files get rehydrated to make your work-tree, so that you can see and work with those files. All three copies of each file now match.³

When you run git status, Git does two separate git diff operations:

The first one compares the HEAD commit to the index. If the files here match, git status says nothing. If they differ, git status prints the names of the differing files, calling these staged for commit.
The second one compares the index to the work-tree. If the files here match, git status says nothing. If they differ, git status prints the names of the differing files, calling these not staged for commit.

If all this makes sense so far, you're now ready to understand—and write—clean and smudge filters.

²You can use git ls-files—and in particular, git ls-files --stage, to dump out a summary of what's in the index. If you do, you'll see that, like commits, the index actually stores just file names and references to Git's blob objects.

However, even though the index is storing a reference rather than an actual copy, you can think about it as a copy. It functions the same way, it's just that if you make a blob you end up not using in the end, Git has to garbage-collect it later. This is not normally a problem: Git generates garbage objects on its own, all the time, and cleans them up on its own.

³There are some rules to let them not all match on every git checkout: see Checkout another branch when there are uncommitted changes on the current branch.

Clean and smudge filters

Given the pictures above, we now see that files move from the HEAD commit to the index and then to the work-tree:

  HEAD          index         work-tree
---------      ---------      ---------
README.md  ->  README.md  ->  README.md

That happens on git checkout and also on git reset --hard, for instance. In Git 2.23 and later, git restore can do this as well.

For speed and other reasons, the index copy is always in the freeze-dried Git-only format. The work-tree copy is an ordinary, everyday file, in ordinary, everyday format. So the process that copies from Git-ified format to everyday work-tree file must do a bunch of work to de-compress ("rehydrate") the file. This can change the file. What if Git let you insert your own "change the file" operations here?

Meanwhile, git add does this:

  HEAD          index         work-tree
---------      ---------      ---------
README.md      README.md  <-  README.md

This replaces the index copy with a whole new README.md. The git add command must compress and Git-ify the file—freeze-dry it, as it were. This can change what gets added. What if Git let you insert your own changes here?

This is precisely what clean and smudge filters are. You get to insert your own filtering, in either direction:

A smudge filter takes a compressed, Git-ified file (that Git has just de-compressed and de-Git-ified) that's on its way to the work-tree. You can make whatever changes you want to that file's data.
A clean filter takes a work-tree file—or rather, its content—that's on its way to the index. You can make whatever changes you want to that data.

Hence, in your clean filter, you could strip XML attributes, or replace them, programmatically.

The catch here is that you must write this code yourself. You've already done something very similar for your text-diff, though. The clean and smudge filters work the same way as the text-diff filters; they're just used in a different portion of the Git pipelines. You set up these filters with .gitattributes and .git/config, just as you did with the text-diff ones.

Note that git status may sometimes be fooled a bit. It becomes hard to tell whether a file is modified just by looking at date-and-time-stamps. Running git checkout or git add will, if Git thinks the files are modified, force the data to pass through the various filters and update cache information stored in the index, after which Git will again assume that the work-tree file "matches" the index copy despite some difference that was produced or eliminated by a smudge or clean filter.

Special cases to consider

If you want to set certain XML data in the index copy to match that in the HEAD copy, you'll need to extract the HEAD copy of the file. This means your clean and smudge filters will need to coordinate, because extracting the HEAD copy will run its content through the smudge filter.

Renamed files get problematic here. You can find the name of the file using %f but that's one name. If the file has been renamed, what was the old name?

One remaining slightly-sticky situation occurs in a new, totally-empty repository. Here, there is no HEAD commit yet because there are no commits at all. That's easily avoided: don't set up filters until there is an initial commit, or if there is no HEAD commit, have your filters do nothing. Note that this same situation occurs when using git checkout --orphan: that puts your Git into a state in which HEAD holds the name of a branch that does not yet exist, so attempting to resolve HEAD to a commit hash ID, or extract a file from the commit named by HEAD, will fail. You could this the same way, or just forbid the use of git checkout --orphan.