1

So I have many files to download from a server. Over 1000... So I thought I'd write a script that would do this for me that is multithreaded so that I don't have to wait for ages for it to finish. The problem is that it spits out a bunch of errors. I have searched for this but couldn't really find anything that seemed to be related to the error that I'm having as I don't print out any output in my other threads.

my plan was to have the threads chain start each other so that they don't happen to take one file twice and skip some other file.

thanks for any help!

mylist = [list of filenames in here]

mycount = 0

def download():
    global mycount
    url = "http://myserver.com/content/files/" + mylist[mycount]
    myfile = urllib2.urlopen(url)
    with open(mylist[mycount],'wb') as output:
        output.write(myfile.read())


def chain():
    global mycount
    if mycount <= len(mylist) - 1:
        thread.start_new_thread( download, ())
        mycount = mycount + 1
        chain()

chain()
Thomja
  • 259
  • 5
  • 15
  • Can you post an error? Python recursion depth may be 1000. Put the thread generator in a for loop instead. – tdelaney Aug 14 '16 at 14:51
  • Btw your code still won't work. By the time the thread runs the master thread will have incremented mycount many times. You should pass the file name to be processed to the thread. Consider using a thread pool instead. – tdelaney Aug 14 '16 at 14:56
  • @tdelaney Thanks, will change to a for loop and try, also here's the error: http://puu.sh/qB2IW/42b809c5de.png You mean as a tuple variable argument for the thread start function? Also I have never worked with thread pools before. – Thomja Aug 14 '16 at 15:09
  • Changed it to a for loop instead still didn't work, command line spits out the same messages: http://puu.sh/qB336/153b501a22.png – Thomja Aug 14 '16 at 15:13

1 Answers1

0

So I actually managed to make it work after some more googling and reading.

current code:

mycount = 0
mylistLoc = []
for index in range(len(mylist)):
    mylistLoc.append(index)

def download(mycount):
global mylistLoc
url = "http://myserver.com/content/zips/" + mylist[mycount]
try:
    myfile = urllib2.urlopen(url)
    with open(mylist[mycount],'wb') as output:
        output.write(myfile.read())
except:
    print "Could not open URL: " + url
mycount = mycount + 1
mylistLoc = [mycount]

from multiprocessing.dummy import Pool as ThreadPool 
pool = ThreadPool(50) 
pool.map(download, mylistLoc)

First of all I want to mention that the try: statement has nothing to do with the multithreading for all the newbies that may read this.

But I found this method on another stackoverflow thread that explains this a bit better.

How to use threading in Python?

But basically if I understand this correctly, you first import all the necessary functions with the

from multiprocessing.dummy import Pool as ThreadPool

And then you set how many threads you want active. For me 50 was good. But make sure to read up more on this as having too many can cause problems.

then the juice

pool.map(download, mylistLoc)

Basically, call pool map for every number in mylistLoc.

I think it's a weird system but what do I know. It works. Also, I'm a newbie myself.

Community
  • 1
  • 1
Thomja
  • 259
  • 5
  • 15
  • 1
    Your original code used 1000 threads and likely went over some resource limit. The optimum number of threads for a pool depends on many factors - you can experiment with different numbers to see what works best. Consider setting chunk size to 1 on the map call. Otherwise workers get many urls to work with in one chunk. With a small chunk size, workers grabbing faster urls can process more of them and you finish faster. – tdelaney Aug 15 '16 at 14:34