5

I am currently on Mac.

In Git 2.35.1, when I cloned my repository, it took 7 seconds to enumerate the untracked files and when I did time git status, it took approximately 2 seconds. And, when I checkout to other branch it took approximately 15 seconds and when I checkout back to my main repo git status took 15 seconds (which should not take this much time).

Work-around for this in (2.35.1) was: I set core.untrackedCache=true and GIT_FORCE_UNTRACKED_CACHE=1 which helped to update the untracked cache and improve the performance of git status of (approximately 4 seconds) which are mentioned in most of Stack Overflow answers. stack-overflow question

But now in Git 2.36.1, this work-around doesn't seem to work. It takes approximately 20 seconds on all branches.

Possible changes in the code:

In Git 2.35.1, code in dir.c:

if (dir->untracked) {
        static int force_untracked_cache = -1;

        if (force_untracked_cache < 0)
            force_untracked_cache =
                git_env_bool("GIT_FORCE_UNTRACKED_CACHE", 0);
        if (force_untracked_cache &&
            dir->untracked == istate->untracked &&
            (dir->untracked->dir_opened ||
             dir->untracked->gitignore_invalidated ||
             dir->untracked->dir_invalidated))
            istate->cache_changed |= UNTRACKED_CHANGED;
        if (dir->untracked != istate->untracked) {
            FREE_AND_NULL(dir->untracked);
        }
    }

and the same in Git 2.36.1, code in dir.c:

if (dir->untracked) {
        static int force_untracked_cache = -1;

        if (force_untracked_cache < 0)
            force_untracked_cache =
                git_env_bool("GIT_FORCE_UNTRACKED_CACHE", -1);
        if (force_untracked_cache < 0)
            force_untracked_cache = (istate->repo->settings.core_untracked_cache == UNTRACKED_CACHE_WRITE);
        if (force_untracked_cache &&
            dir->untracked == istate->untracked &&
            (dir->untracked->dir_opened ||
             dir->untracked->gitignore_invalidated ||
             dir->untracked->dir_invalidated))
            istate->cache_changed |= UNTRACKED_CHANGED;
        if (dir->untracked != istate->untracked) {
            FREE_AND_NULL(dir->untracked);
        }
    }

Edit 1 :

GIT_TRACE_PERFORMANCE=1 git status

12:44:54.433726 read-cache.c:2437       performance: 0.092473000 s: read cache .git/index
12:44:54.915510 read-cache.c:2480       performance: 0.481510000 s: read cache .git/sharedindex.f6119c27ffbee28b22e1baa47e66f355491292e
12:45:05.369546 preload-index.c:154     performance: 10.374954000 s: preload index
Refresh index: 100% (1164397/1164397), done.
12:45:05.421952 read-cache.c:1721       performance: 10.427363000 s: refresh index
12:45:05.464869 diff-lib.c:266          performance: 0.040042000 s:  diff-files
12:45:05.478549 unpack-trees.c:1884     performance: 0.000028000 s: traverse_trees
12:45:05.493406 unpack-trees.c:424      performance: 0.000008000 s:check_updates
12:45:05.493444 unpack-trees.c:1974     performance: 0.028052000 s: unpack_trees
12:45:05.493454 diff-lib.c:629          performance: 0.028099000 s:  diff-index
On branch default

Your branch is up to date with 'origin/default'.

and when I switch the branch and come back to default branch below is the performance. I am not sure why the read-cache.c below is taking this much time!

GIT_TRACE_PERFORMANCE=1 git status
12:22:24.343325 read-cache.c:2437       performance: 0.112630000 s: read cache .git/index
12:22:42.618493 read-cache.c:2480       performance: 18.274836000 s:read cache .git/sharedindex.5ad8766e997830f32884b42ca5b17c2be6a19f1
12:22:53.559907 preload-index.c:154     performance: 10.840555000 s: preload index
Refresh index: 100% (1164397/1164397), done.
12:22:53.646110 read-cache.c:1721       performance: 10.926760000 s: refresh index
12:22:53.685650 diff-lib.c:266          performance: 0.038002000 s:  diff-files
12:22:53.713422 unpack-trees.c:1884     performance: 0.000042000 s: traverse_trees
12:22:53.726052 unpack-trees.c:424      performance: 0.000008000 s: check_updates
12:22:53.726085 unpack-trees.c:1974     performance: 0.028672000 s:unpack_trees
12:22:53.726094 diff-lib.c:629          performance: 0.039895000 s:  diff-index
12:23:03.568051 read-cache.c:3121       performance: 0.161937000 s: write index, changed mask = c
On branch default

Your branch is up to date with 'origin/default'.

You are in a sparse checkout with  tracked files present.
Changes not staged for commit:
 Modified:
 Modified:
….

Edit 2:

I did some research and found that .git/sharedindex. is created when I set core.splitindex =true and sharedindex is taking time. so does it has to do anything with performance?

How can I solve this untracked files cache performance issue? Is there any workaround?

tom
  • 21,844
  • 6
  • 43
  • 36
checked
  • 82
  • 10
  • Are you able to reproduce this reliably? I tried but didn't have any luck. It would be ideal to have exact steps (starting with a minimal ~/.gitconfig and cloning some public repo, e.g. [Git](https://github.com/git/git.git) itself, or the [Linux kernel](https://github.com/torvalds/linux.git) if a huge repo is needed). Good job identifying "read cache .git/sharedindex", maybe a Git expert could work it out from there. (Note: GIT_FORCE_UNTRACKEDCACHE should be GIT_FORCE_UNTRACKED_CACHE, I fixed it in your question.) – tom Jun 17 '22 at 19:05

3 Answers3

2

That change comes from commit 26b8946, that I presented in "How do I get rid of the warning "Untracked cache is disabled on this system."".
It fixed the setting core.untrackedCache which, when set to true, failed to add the untracked cache extension to the index.

In your case, maybe adding automatically the untracked cache extension to the index is the issue.

See commit 26b8946 (17 Feb 2022) by Derrick Stolee (derrickstolee).
(Merged by Junio C Hamano -- gitster -- in commit 80f7f61, 25 Feb 2022)

dir: force untracked cache with core.untrackedCache

Signed-off-by: Derrick Stolee

The GIT_FORCE_UNTRACKED_CACHE environment variable writes the untracked cache more frequently than the core.untrackedCache config variable.
This is due to how read_directory() handles the creation of an untracked cache.

Before this change, Git would not create the untracked cache extension for an index that did not already have one.
Users would need to run a command such as 'git update-index --untracked-cache'(man) before the index would actually contain an untracked cache.

In particular, users noticed that the untracked cache would not appear even with core.untrackedCache=true.
Some users reported setting GIT_FORCE_UNTRACKED_CACHE=1 in their engineering system environment to ensure the untracked cache would be created.

The decision to not write the untracked cache without an environment variable tracks back to fc9ecbe ("dir.c: don't flag the index as dirty for changes to the untracked cache", 2018-02-05, Git v2.17.0-rc0 -- merge listed in batch #8).
The motivation of that change is that writing the index is expensive, and if the untracked cache is the only thing that needs to be written, then it is more expensive than the benefit of the cache.
However, this also means that the untracked cache never gets populated, so the user who enabled it via config does not actually get the extension until running 'git update-index --untracked-cache' manually or using the environment variable.

We have had a version of this change in the microsoft/git fork for a few major releases now.
It has been working well to get users into a good state.
Yes, that first index write is slow, but the remaining index writes are much faster than they would be without this change.

So instead of setting GIT_FORCE_UNTRACKED_CACHE to 1 (keep core.untrackedCache to true), unset it, and try to manually run git update-index --untracked-cache just before a git status or git switch (which replaces git checkout for switching branches).
Test if the performance is acceptable then (again, this is just a test, not a definitive workaround).

tom
  • 21,844
  • 6
  • 43
  • 36
VonC
  • 1,262,500
  • 529
  • 4,410
  • 5,250
  • 1
    I tried this but it didn't help. The regression is same. In the start when I clone the repo, it took 7 seconds to enumerate the untracked files and 2 sec for git status. but when I go back and fourth in branches, it takes 10 sec to enumerate the untracked files and 20 sec for git status. – checked Jun 10 '22 at 19:33
  • Note: there is large diff between two branches files and also have case-sensitive name files – checked Jun 10 '22 at 20:08
  • 1
    @checked You can try in both Git version to set [`TRACE2` env vars](https://stackoverflow.com/a/38285866/6309), like `export GIT_TRACE2_PERFORMANCE=1`: that will help pinpoint what takes time. – VonC Jun 10 '22 at 20:56
  • Typo: should be `GIT_TRACE2_PERF` – tom Jun 17 '22 at 19:14
2

(This isn't a solution, just some debugging suggestions.)

  • You can use GIT_TRACE2_PERF=1 in addition to GIT_TRACE_PERFORMANCE=1 to get more info.

  • On Linux, strace -c <command> outputs system call statistics including the total number of syscalls, which is a useful metric (wall clock time can be problematic because it is influenced by disk caching etc.). And strace <command> displays every syscall individually, which allows you to compare execution traces across runs (I like to filter out memory addresses from the traces using sed 's/0x[0-9a-fA-F]*/0x?/g' because the addresses are different each run and create a lot of noise). On macOS, dtruss provides a similar interface.

  • Git's split index may cause the behaviour to vary unpredictably, because Git conditionally pushes the changes from the split index into the shared index depending on the number of entries in the split index. You can control for this by copying your entire repository (using cp -rp to preserve timestamps) and running the same sequence of commands in each copy (Git 2.35.1 in one copy and Git 2.36.1 in the other copy).

  • "Racy timestamps" can cause commands such as git status to behave differently depending on how quickly you run them after files were changed. You may need to wait a second or two before running git status to get stable behaviour. (Also note that git status modifies the index in some cases, so it may behave differently if you run it a second time.)

  • If you are able to build Git from source (and can reproduce the issue reliably), you can use git bisect to find the bad commit (should only take ~9 build-and-test steps to bisect v2.35.1 to v2.36.1).

tom
  • 21,844
  • 6
  • 43
  • 36
  • 1
    One thing I noticed that making core.splitindex to default (i.e False) brings the regression back normal to 3/4 sec. Is there any co-relation between core.splitindex and untracked_cache. how are untracked_cache updated when we have two split index files (index and sharedindex.) – checked Jun 23 '22 at 00:15
0

I don't seem to find any answer to efficiently manage untracked cache currently for Git 2.36.1.

So, to tackle current regression; I tried cloning 2 branches( default and patch branch) in 2 seperate clones. It gives better performance for now where I don't need to switch branches.

git clone --branch <branchname> <remote-repo-url>

I wanted better efficiency in performance of git status which I currently get it through this solution But I still want both branches in single clone with better performance, will post solution here if I find any.

thanks.

checked
  • 82
  • 10
  • 2
    If you need to switch branch, while cloning only once, don't forget [`git worktree`](https://stackoverflow.com/a/30185564/6309): two different folders, one for each branch, but from the same cloned repository. – VonC Jun 17 '22 at 05:24
  • Ohh. I didn't know about this. Thanks alot I will read about it. – checked Jun 17 '22 at 05:29