1

Description of code

My script below works fine. It basically just finds all the data files that I'm interested in from a given website, checks to see if they are already on my computer (and skips them if they are), and lastly downloads them using cURL on to my computer.

The problem

The problem I'm having is sometimes there are 400+ very large files and I can't download them all at the same time. I'll push Ctrl-C but it seems to cancel the cURL download not the script so I end up needing to cancel all the downloads one by one. Is there a way around this? Maybe somehow making a key command that will let me stop at the end of the current download?

#!/usr/bin/python
import os
import urllib2
import re
import timeit

filenames = []
savedir = "/Users/someguy/Documents/Research/VLF_Hissler/Data/"

#connect to a URL
website = urllib2.urlopen("http://somewebsite")

#read html code
html = website.read()

#use re.findall to get all the data files
filenames = re.findall('SP.*?\.mat', html)

#The following chunk of code checks to see if the files are already downloaded and deletes them from the download queue if they are.
count = 0
countpass = 0
for files in os.listdir(savedir):
   if files.endswith(".mat"):
      try:
         filenames.remove(files)
         count += 1
      except ValueError:
         countpass += 1

print "counted number of removes", count
print "counted number of failed removes", countpass
print "number files less removed:", len(filenames)

#saves the file names into an array of html link
links=len(filenames)*[0]

for j in range(len(filenames)):
   links[j] = 'http://somewebsite.edu/public_web_junk/southpole/2014/'+filenames[j]

for i in range(len(links)):
   os.system("curl -o "+ filenames[i] + " " + str(links[i]))

print "links downloaded:",len(links)
jww
  • 97,681
  • 90
  • 411
  • 885
Howard
  • 87
  • 1
  • 12
  • 1
    Read in the contents into a var and write the contents to a file. You dont have to use curl in this case. Also, to parse HTML you shouldn't use regex. Another tip, use threading to make downloads simultaneous. – heinst Mar 15 '15 at 02:20
  • 1
    Oh thats a good idea, and kinda obvious writing to a file would be good. I'm not sure how I would download the files with out cURL? And why are regex bad for parseing html? I'll look into threading as well. – Howard Mar 15 '15 at 02:35
  • 1
    You would open up the url `links[j] = 'http://somewebsite.edu/public_web_junk/southpole/2014/'+filenames[j]`, read its contents `data = str(urlConn.readlines())` and then open a file for writing `f = open(filenames[j], 'w') f.write(data)` – heinst Mar 15 '15 at 02:41

1 Answers1

0

You could always check the file size using curl before downloading it:

import subprocess, sys

def get_file_size(url):
    """
    Gets the file size of a URL using curl.

    @param url: The URL to obtain information about.

    @return: The file size, as an integer, in bytes.
    """

    # Get the file size in bytes
    p = subprocess.Popen(('curl', '-sI', url), stdout=subprocess.PIPE)
    for s in p.stdout.readlines():
        if 'Content-Length' in s:
            file_size = int(s.strip().split()[-1])
    return file_size

# Your configuration parameters
url      = ... # URL that you want to download
max_size = ... # Max file size in bytes

# Now you can do a simple check to see if the file size is too big
if get_file_size(url) > max_size:
    sys.exit()

# Or you could do something more advanced
bytes = get_file_size(url)
if bytes > max_size:
    s = raw_input('File is {0} bytes. Do you wish to download? '
        '(yes, no) '.format(bytes))
    if s.lower() == 'yes':
        # Add download code here....
    else:
        sys.exit()
James Mnatzaganian
  • 1,255
  • 1
  • 17
  • 32