Description of code
My script below works fine. It basically just finds all the data files that I'm interested in from a given website, checks to see if they are already on my computer (and skips them if they are), and lastly downloads them using cURL on to my computer.
The problem
The problem I'm having is sometimes there are 400+ very large files and I can't download them all at the same time. I'll push Ctrl-C but it seems to cancel the cURL download not the script so I end up needing to cancel all the downloads one by one. Is there a way around this? Maybe somehow making a key command that will let me stop at the end of the current download?
#!/usr/bin/python
import os
import urllib2
import re
import timeit
filenames = []
savedir = "/Users/someguy/Documents/Research/VLF_Hissler/Data/"
#connect to a URL
website = urllib2.urlopen("http://somewebsite")
#read html code
html = website.read()
#use re.findall to get all the data files
filenames = re.findall('SP.*?\.mat', html)
#The following chunk of code checks to see if the files are already downloaded and deletes them from the download queue if they are.
count = 0
countpass = 0
for files in os.listdir(savedir):
if files.endswith(".mat"):
try:
filenames.remove(files)
count += 1
except ValueError:
countpass += 1
print "counted number of removes", count
print "counted number of failed removes", countpass
print "number files less removed:", len(filenames)
#saves the file names into an array of html link
links=len(filenames)*[0]
for j in range(len(filenames)):
links[j] = 'http://somewebsite.edu/public_web_junk/southpole/2014/'+filenames[j]
for i in range(len(links)):
os.system("curl -o "+ filenames[i] + " " + str(links[i]))
print "links downloaded:",len(links)