9

I have a repo with thousands of remotes, and I'd like to pull from thousands of remotes at the same time, ideally I can specify a maximum number to do at the same time.

I wasn't able to find anything related to this in the manpages, google, or git-scm online.

To be perfectly clear: I do not want to run one command over multiple repos, I have one repo with thousands of remotes.

This has nothing to do with submodules, don't talk about submodules. Submodules are unrelated to git remotes.

Incognito
  • 20,537
  • 15
  • 80
  • 120
  • Possible duplicate of [How to speed up / parallelize downloads of git submodules using git clone --recursive?](http://stackoverflow.com/questions/26023395/how-to-speed-up-parallelize-downloads-of-git-submodules-using-git-clone-recu) – phuclv May 18 '17 at 04:11
  • No, this has nothing do do with submodules. This is about an entirely different feature of git. – Incognito Jun 01 '17 at 22:43

3 Answers3

6

I'm pretty sure you have to write your own code to do this.

As CodeWizard says in a comment, Git needs to lock parts of the repository. Some of these locks are bound to collide at times, if you simply run multiple git fetch processes in parallel within a single repository.

You might also want some kind of remote-ordering strategy since, e.g., collecting from remoteA, remoteB, and remoteC in parallel may discover 10000 common objects on remoteB as compared to the other two if remoteB is generally (but not always) a superset of remoteA and remoteC.1 While this also applies to sequential git fetch operations, it becomes considerably less important. Suppose, for example, that there are 5000 objects—some commits, some trees, and some blobs—on A that you do not yet have, 5000 others on C, and all 10000 on B. If you fetch sequentially, in any order, you pick up either 5k, then 5k, then 0; or 10k, then 0, then 0; because by the time you move to the next remote, you have collected and stored the 5k or 10k incoming objects. But if you do all three in parallel, you will bring 5k, 5k, and 10k objects in, and only then discover that you have doubled your workload.


1If B is always a superset, simply go to B first (sequentially), then go to A and C in parallel solely for their references, which will point to objects you now have.

Community
  • 1
  • 1
torek
  • 448,244
  • 59
  • 642
  • 775
  • What's nice about my specific issue is I know for a fact there are no common objects. However there is a strange thing to me about this... I thought git was event-sourcing commit objects. It seems to me in theory that it should be able to collect n-many commit refs and do the integrity map on them later. I think you're right that I may have to author my own solution here. – Incognito Mar 26 '17 at 14:11
  • Even without common commits, there might still be common trees or blobs (although I would suspect it would be less, er, "common" :-) ). But with the no-common-commits guarantee, you might want to look into using `git bundle` behind the scenes, and then sequentially scanning (as in run `git fetch` on) completed bundles and updating metadata for the given remote. – torek Mar 26 '17 at 16:21
6

Starting from Git 2.24 it it now possible with [--jobs] option.

Some examples:

Fetching 3 remotes, 2 remotes will be fetched in parallel:

git fetch -j2 --multiple remote1 remote2 remote3

Fetching all remotes, 5 remotes will be fetched in parallel:

git fetch -jobs=5 --all

If you have thousands of remotes and you don't want to download all of them and they form some logical groups. Instead of specifying them in command line (with --multiple) options You can also define remote groups like this in .git/config

[remotes]
    group1 = remote1 remote2 origin
    group2 = remote55 remote66

And then use this group in fetch command.

This command: git fetch --multiple -j4 group1 group2 remote10 fetches remote1 remote2 origin remote55 remote66 remote10 remotes and 4 fetches are done in parallel.

Mariusz Pawelski
  • 25,983
  • 11
  • 67
  • 80
1

git pull multiple remotes in parallel: Starting from Git 2.24, it it now possible with [--jobs] option.

Then make sure to use Git 2.40 (Q1 2023): "git fetch --jobs=0"(man) used to hit a BUG(), which has been corrected to use the available CPUs.

See commit c39952b (20 Feb 2023) by Matthias Aßhauer (rimrul).
(Merged by Junio C Hamano -- gitster -- in commit d180cc2, 24 Feb 2023)

fetch: choose a sensible default with --jobs=0 again

Reported-by: Drew Noakes
Signed-off-by: Matthias Aßhauer

prior to 51243f9 ("run-command API: don't fall back on online_cpus()", 2022-10-12, Git v2.39.0-rc0 -- merge listed in batch #7) git fetch --multiple --jobs=0``(man) would choose some default amount of jobs, similar to git -c fetch.parallel=0 fetch --multiple(man).
While our documentation only ever promised that fetch.parallel would fall back to a "sensible default", it makes sense to do the same for --jobs.
So fall back to online_cpus() and not BUG() out.

This fixes "--jobs=0 no longer does any work (git-for-windows/git issue 4302)"

VonC
  • 1,262,500
  • 529
  • 4,410
  • 5,250