How can I download all files in a web directory using python?

Question

I am writing a python script to download all files in a directory.

Example indir:

https://data.chc.ucsb.edu/products/EWX/data/forecasts/CHIRPS-GEFS_precip_v12/daily_16day/2016/01/02

This directory is programmatically generated in my loop due to a specific reason.

tmptime=stime
while tmptime < etime:
    tmptime = tmptime + timedelta(days=1)  # increase timestamp daily
    tmppath = os.path.join(str(tmptime.year), str(tmptime.strftime("%m")), str(tmptime.strftime("%d")))
    indirtmp = os.path.join(indir, tmppath)
    outdir = os.path.join(outdir, tmppath)

Now, how can I download all files in that link and move to another directory outdir I have created in my script? I am okay with a library or offloading it to a linux process.

I will basically be doing this for 20 years every day.

Does this answer your question? [Using wget to recursively fetch a directory with arbitrary files in it](https://stackoverflow.com/questions/273743/using-wget-to-recursively-fetch-a-directory-with-arbitrary-files-in-it) — jthulhu, Mar 15 '22 at 18:03
"I will basically be doing this for 20 years every day.": depending on the amount of data transferred, it sounds like this could be something to consult with the sys-admins on the ucsb side. — 9769953, Mar 15 '22 at 18:05
@BlackBeans No. I need to generate these directories on my way due to the specific requirements of my project. I am okay with wget ing inside the python. — Obsidian Order, Mar 15 '22 at 18:37

Daweo · Answer 1 · 2022-03-16T10:26:44.033

I suggest following wget command to download superset of files you need

wget --force-html --base=https://data.chc.ucsb.edu/products/EWX/data/forecasts/CHIRPS-GEFS_precip_v12/daily_16day/2016/01/02/ -i https://data.chc.ucsb.edu/products/EWX/data/forecasts/CHIRPS-GEFS_precip_v12/daily_16day/2016/01/02/

Explanation: I used -i option with external file, --force-html prompts GNU Wget to look for links inside pointed file, --base=https://data.chc.ucsb.edu/products/EWX/data/forecasts/CHIRPS-GEFS_precip_v12/daily_16day/2016/01/02/ is required as file here uses relative links. Note that this will download all files referenced, so you might need to remove non-tiff files after download finish. Files are saved in current working directory.

score -1 · Answer 2 · answered Mar 15 '22 at 18:04

-1

Since you say you're okay to shelling out to a program, you can spare the trouble of parsing that index HTML by using wget's mirror mode:

import os
import shlex

tmptime=stime
while tmptime < etime:
    tmptime = tmptime + timedelta(days=1)  # increase timestamp daily
    tmppath = os.path.join(str(tmptime.year), str(tmptime.strftime("%m")), str(tmptime.strftime("%d")))
    indirtmp = os.path.join(indir, tmppath)
    outdir = os.path.join(outdir, tmppath)

    # assumes `indir` is the internet URL
    os.system(shlex.join(["wget", "-m", "-np", "-nd", "-P", outdir, indir]))

answered Mar 15 '22 at 18:04

AKX

152,115
15
115
172

This doesn't seem to work. I get 2 files named 31 and robots.txt – Obsidian Order Mar 15 '22 at 18:36
Well, if you peek in that robots.txt, it's probably telling you that you shouldn't be automatically crawling the site, and wget respects that by default. – AKX Mar 15 '22 at 18:51

How can I download all files in a web directory using python?

2 Answers2