4

Please would you advise about an effective method to download a large number of files from EBI : https://github.com/eQTL-Catalogue/eQTL-Catalogue-resources/tree/master/tabix

We can use wget sequentially on each file. I have seen some information about using a python script : How to parallelize file downloads?

although there might be some complementary ways by using bash script or R ?

RavinderSingh13
  • 130,504
  • 14
  • 57
  • 93
Bogdan
  • 345
  • 1
  • 16
  • 3
    It's not obvious to me how parallelization is going to help since the bottleneck is probably in the network connection ... ?? – Ben Bolker Feb 01 '22 at 21:41
  • 1
    @BenBolker, not guaranteed, but I have experienced remote servers that throttled *each connection*, so parallel connections were limited by policy and not network capacity. Further, since it is a limit of the network capacity between local and *each* remote site, if the URLs resolve to different servers *and* the local pipe is big enough, I can see benefits to parallel downloads. – r2evans Feb 01 '22 at 21:51
  • 2
    Search on StackOverflow for **GNU Parallel** tagged answers using `[gnu-parallel]` – Mark Setchell Feb 01 '22 at 22:18
  • That's a lot of files to download; is that the entire eQTL catalogue? Would `wget -r ftp://ftp.ebi.ac.uk/pub/databases/spot/eQTL/csv/` suit your needs? – jared_mamrot Feb 01 '22 at 23:02

3 Answers3

4

If you are not requiring R here, then the xargs command-line utility allows parallel execution. (I'm using the linux version in the findutils set of utilities. I believe this is also supported in the version of wget in git-bash. I don't know if the macos binary is installed by default nor if it includes this option, ymmv.)

For proof, I'll create a mywget script that prints the start time (and args) and then passes all arguments to wget.

(mywget)

echo "$(date) :: ${@}"
wget "${@}"

I also have a text file urllist with one URL per line (it's crafted so that I don't have to encode anything or worry about spaces, etc). (Because I'm using a personal remote server to benchmark this, and I don't that the slashdot-effect, I'll obfuscate the URLs here ...)

(urllist)

https://somedomain.com/quux0
https://somedomain.com/quux1
https://somedomain.com/quux2

First, no parallelization, simply consecutive (default). (The -a urllist is to read items from the file urllist instead of stdin. The -q is to be quiet, not required but certainly very helpful when doing things in parallel, since the typical verbose option has progress bars that will overlap each other.)

$ time xargs -a urllist ./mywget -q
Tue Feb  1 17:27:01 EST 2022 :: -q https://somedomain.com/quux0
Tue Feb  1 17:27:10 EST 2022 :: -q https://somedomain.com/quux1
Tue Feb  1 17:27:12 EST 2022 :: -q https://somedomain.com/quux2

real    0m13.375s
user    0m0.210s
sys     0m0.958s

Second, adding -P 3 so that I run up to 3 simultaneous processes. The -n1 is required so that each call to ./mywget gets only one URL. You can adjust this if you want a single call to download multiple files consecutively.

$ time xargs -n1 -P3 -a urllist ./mywget -q
Tue Feb  1 17:27:46 EST 2022 :: -q https://somedomain.com/quux0
Tue Feb  1 17:27:46 EST 2022 :: -q https://somedomain.com/quux1
Tue Feb  1 17:27:46 EST 2022 :: -q https://somedomain.com/quux2

real    0m13.088s
user    0m0.272s
sys     0m1.664s

In this case, as BenBolker suggested in a comment, parallel download saved me nothing, it still took 13 seconds. However, you can see that in the first block, they started sequentially with 9 seconds and 2 seconds in between each of the three downloads. (We can infer that the first file is much larger, taking 9 seconds, and the second file took about 2 seconds.) In the second block, all three started at the same time.

(Side note: this doesn't require a shell script at all; you can use R's system or the processx::run functions to call xargs -n1 -P3 wget -q with a text file of URLs that you create in R. So you can still do this comfortably from the warmth of your R console.)

r2evans
  • 141,215
  • 6
  • 77
  • 149
  • Thank you. I have tested on a Mac "time xargs -n1 -P3 -a list2.txt ./myget.sh -q" using a set of URLs and the message was : xargs: illegal option -- a – Bogdan Feb 03 '22 at 05:14
  • From [`man xargs`](https://man7.org/linux/man-pages/man1/xargs.1.html) (on non-macos systems), `-a file` is to *"Read items from file instead of standard input"*. Since macos' version does not support that, you should be able to do `time xargs -n1 -P3 ./mywget -q < urllist` or `cat urllist | xargs -n1 -P3 ./mywget -q`. – r2evans Feb 03 '22 at 05:28
0

I had a similar task and my approach was the following: I have used python, redis and supervisord.

  1. I have pushed to a redis list all the paths/urls of the files i needed (i just created a small py script to read my csv and push it to a Redis queue/list.)
  2. then i have created another py script to read (pull) one item from the redis list and download it.
  3. using supervisord, i just launched 10 paralel py files that were pulling data from redis (file paths) and downloading the files.

It might be too complicated for you, but this solution is very scalable, can use multiple servers etc.

Mike
  • 229
  • 1
  • 4
0

Thank you all. I have investigated a few other ways to do it :

#!/bin/bash

    ############################
    while read file; do
        wget ${file} &
    done < files.txt
    
    ###########################
    while read file; do
        wget ${file} -b
    done < files.txt
    
    ##########################
    cat files.txt | xargs -n 1 -P 10 wget -q
Bogdan
  • 345
  • 1
  • 16