Git - max depth in gitignore/exclude

Question

A short story: I have recently made a clean install of Arch Linux on my PC because my old install got very bloated with unnecessary packages and config directories. Now I want to keep my home directory clean and simple. I decided to use git to supervise every file and folder there but I can't just exclude every log(or any other constantly updating dir/file) as it is too much of a hassle.
The idea is to include only the first level of files and directories in $HOME/, $HOME/.config/, and $HOME/.local/share/. For instance, include .config/foo/ and exclude its contents i.e. .config/foo/* so I could check the git log when I uninstall a package what directory(es) did it create and remove them manually(of course, if I won't use it anymore)

I tried to accomplish this by adding this to my .git/info/exclude

*/*
*/*/*
*/*/*/*
*/*/*/*/*
.local/share/*/*
.local/share/*/*/*
.local/share/*/*/*/*
.local/share/*/*/*/*/*
.config/*/*
.config/*/*/*
.config/*/*/*/*
.config/*/*/*/*/*

because I read that git needs a separate wildcard for every directory level. As you probably have already understood - it didn't work.

So, the question is - how can I monitor only the files and directories in $HOME/, $HOME/.config/, and $HOME/.local/share/ without monitoring their contents. Thanks!

torek · Accepted Answer · 2018-01-04T01:54:18.357

TL;DR

What you'll want is to use .gitignore to specifically ignore certain files and subdirectories:

*/
!.config
!.config/*
.config/*/
!.local
!.local/*
.local/*/

To see how this works, and what it does (and doesn't do) for you, read the long version. (The !.config/* is almost certainly unnecessary; I put it in when I had * as part of not saving any top level files, which isn't quite what you asked for. The same holds for !.local/*. Without actually testing it, though, I'm not sure if .config/afile matches the .config rule.)

(But note that you probably do want to source-control additional .config files. I also recommend doing this an entirely different way, using symlinks for the .foorc type files—that's what I do.)

Long

There isn't any maximum depth, other than any system-imposed maximum (which varies depending on your OS). But there's a big problem here: Git doesn't store directories.¹

What Git does store, underneath its top level storage item which is the commit, are files (which Git calls blobs), with associated path names. If you ask Git to extract commit #1234567..., Git looks inside that commit, finds the path names of the various blobs, and creates directories (new, empty ones) if and when necessary to hold the specific blobs (i.e., files) that Git is extracting from that commit with the names they have as stored in that commit.

This doesn't mean that your idea is doomed, just that you're starting with a misconception. Git won't save the directory .config at all, for instance. It will just save the file .config/Trolltech.conf, for instance. If Git has saved that file in some commit, and you git checkout that specific commit, Git will create a new, empty .config if required. If the directory already exists, Git won't do anything about that. In some cases, such as moving from a commit in which that file exists to one in which it does not, Git will remove the directory as well, but in some cases it won't, and you will need to use git clean -d to make Git really remove it (if that's possible, i.e., if it's empty).

Having saved that particular file, if Git is being instructed to ignore the subdirectory .config/git, Git may not save the file .config/git/ignore. This is where things get complicated. You need to understand how Git commits work, what the index is and how (to some extent) it works, and what Git does to work with, and maintain, a work-tree.

¹Git does store tree entries, which could work as a flag by which to save empty directories, but other parts of Git combine in strange ways to make this whole concept fail.

Git is built around the concept of commits

As we noted above, what Git stores, fundamentally, is the commit. A commit is a complete, mostly-standalone snapshot of some set of files, which Git calls blobs. (This deliberately ignores submodules and symbolic links, but they're stored as blobs as well, using tree entries of a type that distinguishes them from plain files.) I say "mostly-standalone" because each commit records some number of parent commit hash IDs, though most commonly, just one. A commit that stores three parent hash IDs depends on those three parent commits' existence: a repository that's missing the three parents is somehow incomplete.² The parent linkage is not important for this particular application, but it's good to know how this works.

There is, though, one particularly difficult event in the life a commit: creating it. Once a commit is created, it is read-only. It has a unique hash ID, determined solely by the commit's content (including all its parent hash IDs). But what files go into a commit? This is the key question and is where .gitignore eventually comes into the picture.

²This is the essence of a shallow clone. A clone that is not shallow (and hence is complete) starts with the tip commits of each branch (and any tagged commits or annotated tag objects). These commits (or annotated tag objects) point back to earlier, ancestor, commits through their parent hash IDs. Since the repository is complete, those objects exist as well; they contain their parent hash IDs, and those commit objects exist, and so on. The whole process stops only when we reach some commit(s) that have no parent. Usually this is the first commit ever made, which obviously can't have a parent. Such a commit is called a root commit, and in any non-empty but complete repository, there is always at least one root commit.

The files in a new commit are set up in the index

Besides the repository itself—the repository being a database of Git objects, i.e., commits and blobs and the intermediate thing Git calls a tree (these store the files' names, among other data)—Git has this key data structure with three different names. It's variously called the index, the staging area, and the cache.

The index is normally pretty much invisible. There is one Git command, git ls-files, that can show you the contents of the index directly (git ls-files --stage, or even more verbosely, git ls-files --debug), but it's not really useful to end users. A good top-level description of the index, though, is that it's where you build your next commit.

When you run git commit, Git takes every file that is currently in the index, in whatever form it currently has in the index, and makes a new commit out of that. Those are the files stored in the new commit. The new commit's author and committer are you; the time stamp is "now"; and the parent of the new commit is whatever commit you had checked out before; but the files—the blobs and their associated names—are entirely set by whatever is in the index.³ Likewise, when you use git checkout to extract some particular commit, what Git does first is to copy that commit's files into the index.

Note that when you do make a new commit, that new commit becomes the current commit. When that happens, Git updates the current branch name—the branch you have checked out, such as master—so that it records the new commit. In fact, each branch name records just one hash ID. Git calls this the tip of the branch. As we saw in footnote 2 above, Git works backwards, starting from branch tips, to find all the commits contained within a branch. So making a new commit shoves the new commit's hash ID into the branch name table.

³Even if you use git commit -a or git commit <file>, Git really just copies files into the index—or sometimes, an (auxiliary) index—and builds the commit from that index.

The work-tree

All the files stored inside Git, both in the repository and in the index, are in a special, Git-only format. Few if any other programs on the computer can work with these files, so Git extracts each file into a usable version, where you can do work. This is your work-tree.

In general, every file that's in the current commit also appears in the work-tree. The current commit is, of course, the one you ran git checkout on. If you just ran git checkout master to check out the master branch, what you did in terms of current commit was to check out whatever commit the name master identifies: the tip commit of that branch.

As we mentioned above, all the files (blob objects) got copied into the index, at that point. Git was also able to use whatever was in the index to know what was in your work-tree before that point: for any file that was in the index (and hence in the work-tree) and now isn't in the index because of this checkout, Git should remove that file from the work-tree. And it does! For any file that Git has to replace in the index, or add to the index, Git should copy the index version to the work-tree—and it does.

What's in the index after the git checkout is exactly whatever blobs are (via any intermediate tree objects) in the commit you checked out. The work-tree versions of those files will match the index versions of those files, except that the work-tree versions are actually usable. The index versions of those files will match the commit's versions of those files—and in fact, they share the underlying storage, as the index stores just the path names and blob hash IDs.

Now, there may be files in the work-tree that Git doesn't know about. These files are, by definition, not in the index. These are untracked files. That is what an untracked file is, in Git: it's a file that's not in the index. There is nothing more to it.

(Well, you can remove a file from the index. Then it's not in the index, and hence untracked. That's not really anything more, but it's worth remembering.)

Ignoring untracked files

The problem with untracked files is that Git whines about them. :-) It's constantly griping at you, telling you that files A, B, and C are untracked. So this is where .gitignore comes in—but .gitignore is about the work-tree, and unlike commits, the work-tree does have directories.

You can list specific files in .gitignore. If those files are not in the index (are untracked), but are in the work-tree, Git would complain about them ... but then it sees that they're listed in .gitignore and shuts up.

You can also git add files en-masse, using git add . or git add --all. This has Git scan the work-tree for files, and upon finding them, git add each one to the index, to copy the work-tree version into the builds-the-next-commit index version. Clearly, if files A, B, and C are currently both untracked and ignored, though, Git shouldn't add them. So .gitignore also tells Git not to add existing untracked-and-ignored files to the index.

Existing files that are in the index are automatically tracked, so any en-masse git add that might potentially add those files, will add them, regardless of what's listed in .gitignore. In other words, adding a tracked file to .gitignore has no effect on it. Being in .gitignore only affects untracked files.

But that's files, not directories. This is where everything gets squirrelly. Files exist inside directories, in the normal file system (i.e., not in Git, but in the work-tree).

One of the big reasons Git has the index (and calls it the cache) is that looking at every file in a big file-tree tends to be extremely slow. Git can use the index to record information about all the tracked files, including information that speeds up en-masse git add --all style operations. That's fine for files that are in the index, but what about for whole subdirectories that (a) aren't in the index, so by definition they're untracked and (b) will be ignored, so they won't go into the index and will remain untracked?

Git can avoid scanning those subdirectories entirely. If .config/dir/ is going to be ignored, and Git has just come across the name .config/dir and it's a directory, why then, Git can just skip reading inside it. That's a lot faster than reading it and checking every file to see if it should be ignored.

When Git is scanning the work-tree, it starts at the top and reads the whole contents of the tree: all file names and all sub-directory names. It knows which are files and which are sub-directories, but it has not yet looked inside any of the subdirectories.

Now, Git checks all the files: are they in the index? If so, they're tracked: see if they should be updated. If not, they're untracked: see if Git should whine about them.

Next, Git checks all the sub-directories. For each sub-directory: are there any files for it that are in the index? If so, the sub-directory must be examined. But if not, is the sub-directory ignored? If so, don't even look inside it. Otherwise, look inside it, just as we would if there were files in the index.

Now, for each file or sub-directory, there can be one or more .gitignore entries. An entry ending with * matches files and directories. An entry ending with */ matches directories. An entry starting with ! means: explictly not ignored.

So, suppose Git is scanning the top level and comes across the name .a, and it's a file. Git will look for any ignore entry matching .a. If there's an entry */, well, that doesn't match .a; so .a is added, unless there's a later entry overriding it. There isn't, so we add the file .a.

Next, Git encounters .adir, which is a directory. There are no .adir files in the index, so a scan isn't forced, so Git will check for an ignore entry matching .adir. Since */ is the only match, Git gets to ignore the directory. It will now not look inside .adir at all (unless and until you somehow add .adir/file to the index, which forces Git to read .adir to check whether .adir/file still exists).

When Git comes across .config (which is a directory), there's a */ that says to ignore it, but it's overridden by !.config which says not to ignore it. There's a .config/* but this is just .config-the-directory, not .config/something. So !.config is the last applicable entry, and Git must scan .config.

Sooner or later,⁴ Git will look inside .config. It may find .config/afile; this matches !.config/*. The last entry that it matches tells Git that the file isn't ignored, so it will be added to the index. Then Git comes across .config/git, which is a directory. It matches !.config/*, then .config/*/; so it gets ignored. Git never looks inside .config/git at all.

This repeats for the rest of .config. There may be more .-files, which Git will process as usual, until Git comes across .local, which works just like .config here.

As always, remember that this cannot affect any existing commits. Checking out any existing commit that has some file that violates the .gitignore rules here will cause Git to extract that file, creating its parent directory or directories if needed. Moving from that commit to one that lacks that same file, Git will remove the file, and if the directory containing it goes empty, usually⁵ remove the directory as well.

⁴This is where depth-first vs breadth-first scan comes in. Git currently does ASCII-sorted, depth-first directory traversal (so it's actually "right now") because of the way Git organizes the index. It doesn't matter from our "what gets ignored and what doesn't" perspective, though.

⁵Every once in a while I see weird behavior here that convinces me that there must be some bugs in this. The occasional git clean -ndf to see what would be cleaned, perhaps followed by git clean -df to actually do the cleaning, is useful. But I can never reproduce it, and it's never important enough to try... :-)

Wow, that was very informative and enlightening. Can't believe you wrote all this just to answer my question(even if you didn't and just copied it over, I'm still very grateful to you). Thank you so much! — Gray K, Jan 04 '18 at 02:29