I'm writing an application that needs persistent local access to large files fetched via http. I want the files saved in a local directory (a partial mirror of some sort), so that subsequent executions of the application simply notice that the URLs have already been mirrored locally, and so that other programs can use them.
Ideally, it would also preserve timestamp or etag information and be able to make a quick http request with an If-Modified-Since or If-None-Match header to check for a new version but avoid a full download unless a file has been updated. However, since these web pages rarely change, I can probably live with bugs from stale copies, and just find other ways to delete files from the cache when appropriate.
Looking around, I see that urllib.request.urlretrieve can save cached copies, but it looks like it can't handle my If-Modified-Since cache-updating goal.
The requests module seems like the latest and greatest, but it doesn't seem to work for this case. There is a CacheControl add-on module which supports my cache-updating goal since it does full HTTP caching. But it seems that it doesn't store the fetched files in a way that is directly usable to other (non-python) programs, since the FileCache stores the resources as pickled data. And the discussion at can python-requests fetch url directly to file handle on disk like curl? - Stack Overflow suggests that saving to a local file can be done with some extra code, but that doesn't seem to mash up well with the CacheControl module.
So is there a web fetching library that does what I want? That can essentially maintain a mirror of files that have been fetched in the past (and tell me what the filenames are), without my having to manage all that explicitly?