1

We are using Git for our project. Repository is rather huge (the .git folder is about 8Gb).

We are using git checkout -f in post-receive hook to update working tree.

The problem is that checking out of even a couple of slightly changed files takes too long, approximately 20 seconds. I've no idea why is it so long.

Can be that the problem of the repository size?

What steps or tools should I try to locate and investigate the problem further?

Thank you for any help.

Regards, Alex

Yuri
  • 4,254
  • 1
  • 29
  • 46
  • A 8GB repository sounds like you use git wrong. Is the checked-out tree also of similar size or did you e.g. just put all you binary file revisions in there? I think I remember that some KDE test repos were around 2GB, and the whole Linux kernel history is well below 1GB. – Benjamin Bannier Nov 06 '12 at 13:12
  • Yes, that's because of binary files, and I wonder if it can be the reason of slow checkout, just to update two slightly changed files? – user1788078 Nov 06 '12 at 14:22
  • `git checkout` should be faster on big repo with git 2.8 (March 2016). See [my edited answer below](http://stackoverflow.com/a/13253896/6309) – VonC Feb 05 '16 at 09:27

2 Answers2

1

Original answer (Nov 2012)

I confirm git will slow down considerably if you keep a git directory (.git) that large.

You can see an illustration in this thread (not because of large files, but because of large number of files and commit history):

The test repo has 4 million commits, linear history and about 1.3 million files.
The size of the .git directory is about 15GB, and has been repacked with '

git repack -a -d -f --max-pack-size=10g --depth=100 --window=250

This repack took about 2 days on a beefy machine (I.e., lots of ram and flash).
The size of the index file is 191 MB.

At the very least, you could consider splitting the repo, isolating the binaries in their own git repo and using submodules to keep track between the source and binary repositories.

It is best to store large binary files (especially if they are generated) outside of a source referential.
An "artifact" repository is recommended, like Nexus.

All-git solution to appear keeping those binaries are git-annex or git-media, as presented in "How to handle a large git repository?".


Update February 2016: git 2.8 (March 2016) should improve somewhat significantly the git checkout performance.

See commit a672095 (22 Jan 2016), and commit d9c2bd5 (21 Dec 2015) by David Turner (dturner-tw).
(Merged by Junio C Hamano -- gitster -- in commit 201155c, 03 Feb 2016)

unpack-trees: fix accidentally quadratic behavior

While unpacking trees (e.g. during git checkout), when we hit a cache entry that's past and outside our path, we cut off iteration.

This provides about a 45% speedup on git checkout between master and master^20000 on Twitter's monorepo.
Speedup in general will depend on repostitory structure, number of changes, and packfile packing decisions.

do_compare_entry: use already-computed path

In traverse_trees, we generate the complete traverse path for a traverse_info.
Later, in do_compare_entry, we used to go do a bunch of work to compare the traverse_info to a cache_entry's name without computing that path.
But since we already have that path, we don't need to do all that work.
Instead, we can just put the generated path into the traverse_info, and do the comparison more directly.

This makes git checkout much faster -- about 25% on Twitter's monorepo.
Deeper directory trees are likely to benefit more than shallower ones
.


Using sparse-checkout, a checkout of a huge repository can be considerably speed up.

And that improved even more with Git 2.33 (Q3 2021), where "git checkout"(man) and git commit(man) learned to work without unnecessarily expanding sparse indexes.

See commit e05cdb1, commit 70569fa (20 Jul 2021), and commit 1ba5f45, commit f934f1b, commit daa1ace, commit 11042ab, commit 0d53d19 (29 Jun 2021) by Derrick Stolee (derrickstolee).
(Merged by Junio C Hamano -- gitster -- in commit 506d2a3, 04 Aug 2021)

checkout: stop expanding sparse indexes

Signed-off-by: Derrick Stolee

Previous changes did the necessary improvements to unpack-trees.c and diff-lib.c in order to modify a sparse index based on its comparision with a tree.
The only remaining work is to remove some ensure_full_index() calls and add tests that verify that the index is not expanded in our interesting cases.
Include 'switch' and 'restore' in these tests, as they share a base implementation with 'checkout'.

Here are the relevant performance results from p2000-sparse-operations.sh:

Test                                     HEAD~1           HEAD 
--------------------------------------------------------------------------------
2000.18: git checkout -f - (full-v3)     0.49(0.43+0.03)  0.47(0.39+0.05) -4.1% 
2000.19: git checkout -f - (full-v4)     0.45(0.37+0.06)  0.42(0.37+0.05) -6.7% 
2000.20: git checkout -f - (sparse-v3)   0.76(0.71+0.07)  0.04(0.03+0.04) -94.7% 
2000.21: git checkout -f - (sparse-v4)   0.75(0.72+0.04)  0.05(0.06+0.04) -93.3%  

It is important to compare the full index case to the sparse index case, as the previous results for the sparse index were inflated by the index expansion.
For index v4, this is an 88% improvement.

On an internal repository with over two million paths at HEAD and a sparse-checkout definition containing ~60,000 of those paths, 'git checkout'(man) went from 3.5s to 297ms with this change.
The theoretical optimum where only those ~60,000 paths exist was 275ms, so the extra sparse directory entries contribute a 22ms overhead.

VonC
  • 1,262,500
  • 529
  • 4,410
  • 5,250
1

Another approach for this issue is to parallelize checkout (starting Git 2.32, Q2 2021).
As explained in this patch (still in progress):

This series adds parallel workers to the checkout machinery.

The cache entries are distributed among helper processes which are responsible for reading, filtering and writing the blobs to the working tree.
This should benefit all commands that call unpack_trees() or check_updates(), such as: checkout, clone, sparse-checkout, checkout-index, etc.

Local:

           Clone                  Checkout I             Checkout II
Sequential  8.180 s ± 0.021 s      6.936 s ± 0.030 s      2.585 s ± 0.005 s
10 workers  3.406 s ± 0.187 s      2.164 s ± 0.033 s      1.050 s ± 0.021 s
Speedup     2.40 ± 0.13            3.21 ± 0.05            2.46 ± 0.05

Example, with Git 2.32 (Q2 2021), there are preparatory API changes for parallel checkout.

See commit ae22751, commit 30419e7, commit 584a0d1, commit 49cfd90, commit d052cc0 (23 Mar 2021) by Matheus Tavares (matheustavares).
See commit f59d15b, commit 3e9e82c, commit 55b4ad0, commit 38e9584 (16 Dec 2020) by Jeff Hostetler (Jeff-Hostetler).
(Merged by Junio C Hamano -- gitster -- in commit c47679d, 02 Apr 2021)

convert: add [async_]convert_to_working_tree_ca() variants

Signed-off-by: Jeff Hostetler
Signed-off-by: Matheus Tavares

Separate the attribute gathering from the actual conversion by adding _ca() variants of the conversion functions.
These variants receive a precomputed 'struct conv_attrs', not relying, thus, on an index state.
They will be used in a future patch adding parallel checkout support, for two reasons:

  • We will already load the conversion attributes in checkout_entry(), before conversion, to decide whether a path is eligible for parallel checkout.
    Therefore, it would be wasteful to load them again later, for the actual conversion.
  • The parallel workers will be responsible for reading, converting and writing blobs to the working tree.
    They won't have access to the main process' index state, so they cannot load the attributes.
    Instead, they will receive the preloaded ones and call the _ca() variant of the conversion functions.
    Furthermore, the attributes machinery is optimized to handle paths in sequential order, so it's better to leave it for the main process, anyway.

And:

With Git 2.32 (Q2 2021), the checkout machinery has been taught to perform the actual write-out of the files in parallel when able.

See commit 68e66f2 (19 Apr 2021), and commit 1c4d6f4, commit 7531e4b, commit e9e8adf, commit 04155bd (18 Apr 2021) by Matheus Tavares (matheustavares).
(Merged by Junio C Hamano -- gitster -- in commit a1cac26, 30 Apr 2021)

parallel-checkout: add configuration options

Co-authored-by: Jeff Hostetler
Signed-off-by: Matheus Tavares

Make parallel checkout configurable by introducing two new settings: >- checkout.workers and

  • checkout.thresholdForParallelism.
    The first defines the number of workers (where one means sequential checkout), and the second defines the minimum number of entries to attempt parallel checkout.

To decide the default value for checkout.workers, the parallel version was benchmarked during three operations in the linux repo, with cold cache: cloning v5.8, checking out v5.8 from v2.6.15 (checkout I) and checking out v5.8 from v5.7 (checkout II).
The four tables below show the mean run times and standard deviations for 5 runs in: a local file system on SSD, a local file system on HDD, a Linux NFS server, and Amazon EFS (all on Linux).
Each parallel checkout test was executed with the number of workers that brings the best overall results in that environment.

Local SSD:

             Sequential             10 workers            Speedup
Clone        8.805 s ± 0.043 s      3.564 s ± 0.041 s     2.47 ± 0.03 
Checkout I   9.678 s ± 0.057 s      4.486 s ± 0.050 s     2.16 ± 0.03 
Checkout II  5.034 s ± 0.072 s      3.021 s ± 0.038 s     1.67 ± 0.03  

Local HDD:

             Sequential             10 workers             Speedup
Clone        32.288 s ± 0.580 s     30.724 s ± 0.522 s    1.05 ± 0.03 
Checkout I   54.172 s ±  7.119 s    54.429 s ± 6.738 s    1.00 ± 0.18 
Checkout II  40.465 s ± 2.402 s     38.682 s ± 1.365 s    1.05 ± 0.07  

Linux NFS server (v4.1, on EBS, single availability zone):

             Sequential             32 workers            Speedup
Clone        240.368 s ± 6.347 s    57.349 s ± 0.870 s    4.19 ± 0.13 
Checkout I   242.862 s ± 2.215 s    58.700 s ± 0.904 s    4.14 ± 0.07 
Checkout II  65.751 s ± 1.577 s     23.820 s ± 0.407 s    2.76 ± 0.08  

EFS (v4.1, replicated over multiple availability zones):

             Sequential             32 workers            Speedup
Clone        922.321 s ± 2.274 s    210.453 s ± 3.412 s   4.38 ± 0.07 
Checkout I   1011.300 s ± 7.346 s   297.828 s ± 0.964 s   3.40 ± 0.03 
Checkout II  294.104 s ± 1.836 s    126.017 s ± 1.190 s   2.33 ± 0.03  

The above benchmarks show that parallel checkout is most effective on repositories located on an SSD or over a distributed file system.
For local file systems on spinning disks, and/or older machines, the parallelism does not always bring a good performance.
For this reason, the default value for checkout.workers is one, a.k.a.
sequential checkout.

To decide the default value for checkout.thresholdForParallelism, another benchmark was executed in the "Local SSD" setup, where parallel checkout showed to be beneficial.
This time, we compared the runtime of a git checkout -f(man), with and without parallelism, after randomly removing an increasing number of files from the Linux working tree.
The "sequential fallback" column below corresponds to the executions where checkout.workers was 10 but checkout.thresholdForParallelism was equal to the number of to-be-updated files plus one (so that we end up writing sequentially).
Each test case was sampled 15 times, and each sample had a randomly different set of files removed.
Here are the results:

             sequential fallback   10 workers           speedup

10   files    772.3 ms ± 12.6 ms   769.0 ms ± 13.6 ms   1.00 ± 0.02 
20   files    780.5 ms ± 15.8 ms   775.2 ms ±  9.2 ms   1.01 ± 0.02 
50   files    806.2 ms ± 13.8 ms   767.4 ms ±  8.5 ms   1.05 ± 0.02 
100  files    833.7 ms ± 21.4 ms   750.5 ms ± 16.8 ms   1.11 ± 0.04 
200  files    897.6 ms ± 30.9 ms   730.5 ms ± 14.7 ms   1.23 ± 0.05 
500  files   1035.4 ms ± 48.0 ms   677.1 ms ± 22.3 ms   1.53 ± 0.09 
1000 files   1244.6 ms ± 35.6 ms   654.0 ms ± 38.3 ms   1.90 ± 0.12 
2000 files   1488.8 ms ± 53.4 ms   658.8 ms ± 23.8 ms   2.26 ± 0.12  

From the above numbers, 100 files seems to be a reasonable default value for the threshold setting.

Note: Up to 1000 files, we observe a drop in the execution time of the parallel code with an increase in the number of files.
This is a rather odd behavior, but it was observed in multiple repetitions.
Above 1000 files, the execution time increases according to the number of files, as one would expect.

About the test environments: Local SSD tests were executed on an i7-7700HQ (4 cores with hyper-threading) running Manjaro Linux.
Local HDD tests were executed on an Intel(R) Xeon(R) E3-1230 (also 4 cores with hyper-threading), HDD Seagate Barracuda 7200.14 SATA 3.1, running Debian.
NFS and EFS tests were executed on an Amazon EC2 c5n.xlarge instance, with 4 vCPUs.
The Linux NFS server was running on a m6g.large instance with 2 vCPUSs and a 1 TB EBS GP2 volume.
Before each timing, the linux repository was removed (or checked out back to its previous state), and sync && sysctl vm.drop_caches=3 was executed.

git config now includes in its man page:

checkout.workers

The number of parallel workers to use when updating the working tree. The default is one, i.e. sequential execution. If set to a value less than one, Git will use as many workers as the number of logical cores available. This setting and checkout.thresholdForParallelism affect all commands that perform checkout. E.g. checkout, clone, reset, sparse-checkout, etc.

Note: parallel checkout usually delivers better performance for repositories located on SSDs or over NFS. For repositories on spinning disks and/or machines with a small number of cores, the default sequential checkout often performs better. The size and compression level of a repository might also influence how well the parallel version performs.

checkout.thresholdForParallelism

When running parallel checkout with a small number of files, the cost of subprocess spawning and inter-process communication might outweigh the parallelization gains.

This setting allows to define the minimum number of files for which parallel checkout should be attempted.

The default is 100.


And, still with Git 2.32 (Q2 2021), the final part of "parallel checkout":

See commit 87094fc, commit d590422, commit 2fa3cba, commit 6a7bc9d, commit d0e5d35, commit 70b052b, commit 6053950, commit 9616882 (04 May 2021) by Matheus Tavares (matheustavares).
(Merged by Junio C Hamano -- gitster -- in commit a737e1f, 16 May 2021)

checkout-index: add parallel checkout support

Signed-off-by: Matheus Tavares

Allow checkout-index to use the parallel checkout framework, honoring the checkout.workers configuration.

There are two code paths in checkout-index which call checkout_entry(), and thus, can make use of parallel checkout:

  • checkout_file(), which is used to write paths explicitly given at the command line; and
  • checkout_all(), which is used to write all paths in the index, when the --all option is given.

In both operation modes, checkout-index doesn't abort immediately on a checkout_entry() failure.
Instead, it tries to check out all remaining paths before exiting with a non-zero exit code.
To keep this behavior when parallel checkout is being used, we must allow run_parallel_checkout() to try writing the queued entries before we exit, even if we already got an error code from a previous checkout_entry() call.

However, checkout_all() doesn't return on errors, it calls exit() with code 128. We could make it call run_parallel_checkout() before exiting, but it makes the code easier to follow if we unify the exit path for both checkout-index modes at cmd_checkout_index(), and let this function take care of the interactions with the parallel checkout API.
So let's do that.

With this change, we also have to consider whether we want to keep using 128 as the error code for git checkout-index --all(man), while we use 1 for git checkout-index(man) <path> (even when the actual error is the same).
Since there is not much value in having code 128 only for --all, and there is no mention about it in the docs (so it's unlikely that changing it will break any existing script), let's make both modes exit with code 1 on checkout_entry() errors.


Before Git 2.33 (Q3 2021), the parallel checkout codepath did not initialize object ID field used to talk to the worker processes in a futureproof way.

See commit 3d20ed2 (17 May 2021) by Matheus Tavares (matheustavares).
(Merged by Junio C Hamano -- gitster -- in commit bb6a63a, 10 Jun 2021)

parallel-checkout: send the new object_id algo field to the workers

Signed-off-by: Matheus Tavares

An object_id storing a SHA-1 name has some unused bytes at the end of the hash array.
Since these bytes are not used, they are usually not initialized to any value either.
However, at parallel_checkout.c:send_one_item() the object_id of a cache entry is copied into a buffer which is later sent to a checkout worker through a pipe write().
This makes Valgrind complain about passing uninitialized bytes to a syscall.

However, since cf09832 (hash: add an algo member to struct object_id, 2021-04-26, Git v2.32.0-rc0 -- merge listed in batch #15) ("hash: add an algo member to struct object_id", 2021-04-26), using hashcpy() is no longer sufficient here as it won't copy the new algo field from the object_id.
Let's add and use a new function which meets both our requirements of copying all the important object_id data while still avoiding the uninitialized bytes, by padding the end of the hash array in the destination object_id.
With this change, we also no longer need the destination buffer from send_one_item() to be initialized with zeros, so let's switch from xcalloc() to xmalloc() to make this clear.

VonC
  • 1,262,500
  • 529
  • 4,410
  • 5,250