2

I'm downloading multiple SMI files from a database called ZINC using a rather simple code I wrote. However, its speed doesn't look like so good considering the size of files (a few kb) and my internet connection. Is there a way to speed it up?

import urllib2


def job(url):
    ''' This function opens the URL and download SMI files from ZINC15'''

    u = urllib2.urlopen(url) # Open URL
    print 'downloading ' + url # Print which files is being downloaded
    with open('output.smi', 'a') as local_file:
        local_file.write(u.read())


with open('data.csv') as flist:
    urls = ['http://zinc15.docking.org/substances/{}.smi'.format(str(line.rstrip())) for line in flist]
    map(job, urls)
Anonta
  • 2,500
  • 2
  • 15
  • 25
Marcos Santana
  • 911
  • 5
  • 12
  • 21
  • Use [multithreading to download the files in parallel](https://stackoverflow.com/a/16182076/984421). – ekhumoro Sep 24 '17 at 13:39

1 Answers1

4
import threading
import Queue # the correct module name is Queue

MAX_THREADS = 10
urls = Queue.Queue()

def downloadFile():
    while not urls.empty()
        u = urls.get_nowait()
        job(u)


for url in your_url_list:
    urls.put(url)

for i in range(0, MAX_THREADS + 1):
    t = threading.Thread(target=downloadFile)
    t.start()

Basically it imports threading and queu module, the Queu object will hold the data to be used across multiple threads, and each thread will execute the downloadFile() function.

Easy to understand, if it does not, let me know.

Marcos Santana
  • 911
  • 5
  • 12
  • 21
Melardev
  • 1,101
  • 10
  • 22
  • Can you explain what the downloadFile() function is doing? – Marcos Santana Sep 24 '17 at 14:58
  • The queue module is named Queue in python 2 -- the version that the OP use. - [From review](https://stackoverflow.com/review/suggested-edits/17430118) – Taku Sep 24 '17 at 15:54
  • the download file is checking the queue object to see if it has something to be processed (the URLs to fetch from in our example) if it is not empty it a URL from the Queue and passes it to job() which is your method that downloads a file, if the queue is empty we do nothing leading to the termination of the thread running downloadFile(). – Melardev Sep 24 '17 at 17:28
  • very helpful, I modify my code using this and went from 100kb/s to 4~10m/s (my network is pretty good, youtube-dl can get about 7~15m/s) – NamNamNam Nov 30 '18 at 09:08