0

I am writing a python script to download all files in a directory.

Example indir:

https://data.chc.ucsb.edu/products/EWX/data/forecasts/CHIRPS-GEFS_precip_v12/daily_16day/2016/01/02

This directory is programmatically generated in my loop due to a specific reason.

tmptime=stime
while tmptime < etime:
    tmptime = tmptime + timedelta(days=1)  # increase timestamp daily
    tmppath = os.path.join(str(tmptime.year), str(tmptime.strftime("%m")), str(tmptime.strftime("%d")))
    indirtmp = os.path.join(indir, tmppath)
    outdir = os.path.join(outdir, tmppath)

Now, how can I download all files in that link and move to another directory outdir I have created in my script? I am okay with a library or offloading it to a linux process.

I will basically be doing this for 20 years every day.

  • Does this answer your question? [Using wget to recursively fetch a directory with arbitrary files in it](https://stackoverflow.com/questions/273743/using-wget-to-recursively-fetch-a-directory-with-arbitrary-files-in-it) – jthulhu Mar 15 '22 at 18:03
  • "I will basically be doing this for 20 years every day.": depending on the amount of data transferred, it sounds like this could be something to consult with the sys-admins on the ucsb side. – 9769953 Mar 15 '22 at 18:05
  • @BlackBeans No. I need to generate these directories on my way due to the specific requirements of my project. I am okay with wget ing inside the python. – Obsidian Order Mar 15 '22 at 18:37

2 Answers2

0

I suggest following wget command to download superset of files you need

wget --force-html --base=https://data.chc.ucsb.edu/products/EWX/data/forecasts/CHIRPS-GEFS_precip_v12/daily_16day/2016/01/02/ -i https://data.chc.ucsb.edu/products/EWX/data/forecasts/CHIRPS-GEFS_precip_v12/daily_16day/2016/01/02/

Explanation: I used -i option with external file, --force-html prompts GNU Wget to look for links inside pointed file, --base=https://data.chc.ucsb.edu/products/EWX/data/forecasts/CHIRPS-GEFS_precip_v12/daily_16day/2016/01/02/ is required as file here uses relative links. Note that this will download all files referenced, so you might need to remove non-tiff files after download finish. Files are saved in current working directory.

Daweo
  • 31,313
  • 3
  • 12
  • 25
-1

Since you say you're okay to shelling out to a program, you can spare the trouble of parsing that index HTML by using wget's mirror mode:

import os
import shlex

tmptime=stime
while tmptime < etime:
    tmptime = tmptime + timedelta(days=1)  # increase timestamp daily
    tmppath = os.path.join(str(tmptime.year), str(tmptime.strftime("%m")), str(tmptime.strftime("%d")))
    indirtmp = os.path.join(indir, tmppath)
    outdir = os.path.join(outdir, tmppath)

    # assumes `indir` is the internet URL
    os.system(shlex.join(["wget", "-m", "-np", "-nd", "-P", outdir, indir]))

AKX
  • 152,115
  • 15
  • 115
  • 172
  • This doesn't seem to work. I get 2 files named 31 and robots.txt – Obsidian Order Mar 15 '22 at 18:36
  • Well, if you peek in that robots.txt, it's probably telling you that you shouldn't be automatically crawling the site, and wget respects that by default. – AKX Mar 15 '22 at 18:51