21

What is the standard pythonic way to download a new file from a server only if the server copy is newer than the local one?

Either my python-search-fu is very weak today, or one really does needs to roll their own date-time parser and comparer like below. Is there really no requests.header.get_datetime_object('last-modified')? or request.save_to_file(url, outfile, maintain_datetime=True)?

import requests
import datetime

r = requests.head(url)
url_time = r.headers['last-modified']
file_time = datetime.datetime.fromtimestamp(os.path.getmtime(dstFile))
print url_time  #emits 'Sat, 28 Mar 2015 08:05:42 GMT' on my machine
print file_time #emits '2015-03-27 21:53:28.175072' 

if time_is_older(url_time, file_time):
    print 'url modtime is not newer than local file, skipping download'
    return
else:
    do_download(url)
    os.utime(dstFile, url_time) # maintain server's file timestamp

def time_is_older(str_time, time_object):
    ''' Parse str_time and see if is older than time_object.
        This is a fragile function, what if str_time is in different locale?
    '''
    parsed_time = datetime.datetime.strptime(str_time, 
        #Fri, 27 Mar 2015 08:05:42 GMT
        '%a, %d %b %Y %X %Z')
    return parsed_time < time_object
matt wilkie
  • 17,268
  • 24
  • 80
  • 115
  • You can compare `time.time()` values (time in seconds since the epoch). – boardrider Mar 29 '15 at 14:14
  • 1
    @user1656850 yes. My concern with that, perhaps not well expressed well enough, is that I would still be creating my own string-to-time-object parser parameters in order to use time.time(), and that since I would be writing my own I'd be pretty much guaranteed to have bugs or omissions. My Q is trying to find the idiomatic solution. I'm sure there must be one for this common scenario, I just haven't figured out how to find it! – matt wilkie Mar 29 '15 at 21:40
  • 2
    This falls short of a full answer, but you will generally want to use the HTTP If-Modified-Since and have the server only send the data if it's more recent. – Paul Norman Dec 10 '16 at 10:34

2 Answers2

16
import requests
import datetime
from dateutil.parser import parse as parsedate
r = requests.head(url)
url_time = r.headers['last-modified']
url_date = parsedate(url_time)
file_time = datetime.datetime.fromtimestamp(os.path.getmtime(dstFile))
if url_date > file_time :
    download it !
Anthony Labarre
  • 2,745
  • 1
  • 28
  • 39
Sérgio
  • 6,966
  • 1
  • 48
  • 53
  • 3
    The weired thing is that I get the last modified header only when i am using get request, no last modified in head request – Mercury Jul 07 '19 at 19:43
  • I'm not seeing `last-modified` (or `Last-Modified`) in the headers via `head` or via `get`. There's a `Date:` but that updates every time I download. How does this work? (My URL in question is a Google Doc "Anyone can view" sharing URL.) – sh37211 Oct 12 '21 at 02:54
4

I used the following code, which also takes the timezone into account and makes sure both datetime objects are aware.

import datetime
import requests
from dateutil.parser import parse as parsedate

r = requests.head(url)
url_datetime = parsedate(r.headers['Last-Modified']).astimezone()
file_time = datetime.datetime.fromtimestamp(path.getmtime(dst_file)).astimezone()
if(url_date > file_time):
    user_agent = {"User-agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:46.0) Gecko/20100101 Firefox/46.0"}
    r = requests.get(url, headers=user_agent)
    with open(file, 'wb') as fd:
        for chunk in r.iter_content(4096):
            fd.write(chunk)
Digihash
  • 327
  • 1
  • 13