6

In git, is it possible to fetch multiple remotes in parallel?

Would the below work without clashing with the git file locking in the repository.

git config gc.auto 0
git remote |xargs --max-procs=4 -n 1 git fetch
git gc

I had small test with several repositories and it seems to work when all repositories are unrelated to each other.

It would be nice to get feedback if there is a clear technical reason why the parallel fetching command above wouldn't work.

The submodule is supporting parallel fetching but the parallel fetching would be good when utilizing git-subtree approach.

Similar question: git pull multiple remotes in parallel

JohannPetr
  • 61
  • 2

3 Answers3

1

The answer is actually maybe. In particular:

git remote | xargs --max-procs=4 -n 1 git fetch

As you've seen, this actually works when tested, up to a point. I wrote a fancy version of the same kind of thing once (with fancy display control of the fetching process, all written in Python—it turns out that there's a bug in git fetch --progress, though, so that this does not work right with pipes; you must use ptys).

without clashing with the git file locking ... it seems to work when all repositories are unrelated to each other.

That's the rub: each fetch assumes it can get its locks. The fetches need to lock each remote-tracking name, and usually that works just fine since the names are separate—remote A does not interfere with remote B because refs/remotes/A/master and refs/remotes/B/master use different locks—but the final repacking may fail unless you do what you did, disable auto-gc and then run GC yourself (you should also re-renable it afterward).

You may also end up fetching more data than necessary (as I noted in the other answer). There is not much you can do about this without external information, e.g., maybe there's one remote you should always fetch first.

torek
  • 448,244
  • 59
  • 642
  • 775
1

but the final repacking may fail unless you do what you did, disable auto-gc and then run GC yourself

Actually, with Git 2.23 (Q3 2019), that might not be necessary anymore.

"git fetch" that grabs from a group of remotes learned to run the auto-gc only once at the very end.

See commit c3d6b70 (19 Jun 2019) by Nguyễn Thái Ngọc Duy (pclouds).
(Merged by Junio C Hamano -- gitster -- in commit 892d3fb, 09 Jul 2019)

fetch: only run 'gc' once when fetching multiple remotes

In multiple remotes mode, git-fetch is launched for n-1 remotes and the last remote is handled by the current process. Each of these processes will in turn run 'gc' at the end.

This is not really a problem because even if multiple 'gc --auto' is run at the same time we still handle it correctly.
It does show multiple "auto packing in the background" messages though.
And we may waste some resources when gc actually runs because we still do some stuff before checking the lock and moving it to background.

So let's try to avoid that.

We should only need one 'gc' run after all objects and references are added anyway.

Add a new option --no-auto-gc that will be used by those n-1 processes.
'gc --auto' will always run on the main fetch process (*).

(*) even if we fetch remotes in parallel at some point in future, this should still be fine because we should "join" all those processes before this step.

VonC
  • 1,262,500
  • 529
  • 4,410
  • 5,250
1

It seems to work for me out of the box with

git fetch -j 8   

using Git 2.33.1 . The -j switch is a shorthand for --jobs. I remember looking for this earlier but only found out today, the switch might be pretty new.

Some timings for a repo with four GitHub remotes:

$ \time git fetch --all
Fetching origin
Fetching foo
Fetching bar
Fetching baz
        6.40 real         1.28 user         0.21 sys
$ \time git fetch --all -j 8
Fetching origin
Fetching foo
Fetching bar
Fetching baz
        2.06 real         1.30 user         0.16 sys
krlmlr
  • 25,056
  • 14
  • 120
  • 217
  • Please let others know why the switch is `-j`. It because this is in the manual: `-j` / `--jobs=` *Number of parallel children to be used for all forms of fetching.* – MS Berends Jan 20 '22 at 07:05
  • What you did is not parallel. Per the docs: *If the `--multiple option` was specified, the different remotes will be fetched in parallel. (…) Typically, parallel recursive and multi-remote fetches will be faster. By default fetches are performed sequentially, not in parallel.* https://git-scm.com/docs/git-fetch – MS Berends Jan 20 '22 at 07:08
  • 1
    Added the verbose option and timings. – krlmlr Jan 20 '22 at 08:14