If you are not requiring R here, then the xargs
command-line utility allows parallel execution. (I'm using the linux version in the findutils
set of utilities. I believe this is also supported in the version of wget
in git-bash. I don't know if the macos binary is installed by default nor if it includes this option, ymmv.)
For proof, I'll create a mywget
script that prints the start time (and args) and then passes all arguments to wget
.
(mywget
)
echo "$(date) :: ${@}"
wget "${@}"
I also have a text file urllist
with one URL per line (it's crafted so that I don't have to encode anything or worry about spaces, etc). (Because I'm using a personal remote server to benchmark this, and I don't that the slashdot-effect, I'll obfuscate the URLs here ...)
(urllist
)
https://somedomain.com/quux0
https://somedomain.com/quux1
https://somedomain.com/quux2
First, no parallelization, simply consecutive (default). (The -a urllist
is to read items from the file urllist
instead of stdin. The -q
is to be quiet, not required but certainly very helpful when doing things in parallel, since the typical verbose option has progress bars that will overlap each other.)
$ time xargs -a urllist ./mywget -q
Tue Feb 1 17:27:01 EST 2022 :: -q https://somedomain.com/quux0
Tue Feb 1 17:27:10 EST 2022 :: -q https://somedomain.com/quux1
Tue Feb 1 17:27:12 EST 2022 :: -q https://somedomain.com/quux2
real 0m13.375s
user 0m0.210s
sys 0m0.958s
Second, adding -P 3
so that I run up to 3 simultaneous processes. The -n1
is required so that each call to ./mywget
gets only one URL. You can adjust this if you want a single call to download multiple files consecutively.
$ time xargs -n1 -P3 -a urllist ./mywget -q
Tue Feb 1 17:27:46 EST 2022 :: -q https://somedomain.com/quux0
Tue Feb 1 17:27:46 EST 2022 :: -q https://somedomain.com/quux1
Tue Feb 1 17:27:46 EST 2022 :: -q https://somedomain.com/quux2
real 0m13.088s
user 0m0.272s
sys 0m1.664s
In this case, as BenBolker suggested in a comment, parallel download saved me nothing, it still took 13 seconds. However, you can see that in the first block, they started sequentially with 9 seconds and 2 seconds in between each of the three downloads. (We can infer that the first file is much larger, taking 9 seconds, and the second file took about 2 seconds.) In the second block, all three started at the same time.
(Side note: this doesn't require a shell script at all; you can use R's system
or the processx::run
functions to call xargs -n1 -P3 wget -q
with a text file of URLs that you create in R. So you can still do this comfortably from the warmth of your R console.)