0

I'd like to download a bunch of files hosted and password-protected at a url onto a directory within a Python script. The vision is that I'd one day be able to use joblib or something to download each file in parallel, but for now, I'm just focusing on the wget command.

Right now, I can download a single file using:

import os

os.system("wget --user myUser --password myPassword --no-parent -nH --recursive -A gz,pdf,bam,vcf,csv,txt,zip,html https://url/to/file")

However, there are some issues with this - for example, there isn't a record of how the download is proceeding - I only know it is working because I can see the file appear on my directory.

Does anyone have suggestions for how I can improve this, especially in light of the fact that I'd one day like to download many files in parallel, and then go back to see which ones failed?

Thanks for your help!

1 Answers1

0

There are some good libraries to download files via HTTP natively in Python, rather than launching external programs. A very popular one which is powerful yet easy to use is called Requests: https://requests.readthedocs.io/en/master/

You'll have to implement certain features like --recursive yourself if you need those (though your example is confusing because you use --recursive but say you're downloading one file). See for example recursive image download with requests .

If you need a progress bar you can use another library called tqdm in conjunction with Requests. See Python progress bar and downloads .

If the files you're downloading are large, here is an answer I wrote showing how to get the best performance (as fast as wget): https://stackoverflow.com/a/39217788/4323 .

John Zwinck
  • 239,568
  • 38
  • 324
  • 436
  • Oh excellent! Thank you so much for these helpful links! And good catch on --recursive - that was a holdover from the "download everything in folder" version of the wget command but I suppose in the scheme I propose I won't need that anymore. – Kristin M Oct 12 '20 at 03:30