Downloading over 1000 files in python

Question

So, maybe start from my code:

def download(fn, filename, index):
    urllib.request.urlretrieve(fn,
                     os.path.join('music', re.sub('[%s]' % ''.join(CHAR_NOTALLOWED), '', filename) + '.mp3'))
    print(str(index) + '# DOWNLOADED: ' + filename)

and

for index, d in enumerate(found):
    worker = Thread(target=download, args=(found[d], d, index))
    worker.setDaemon(True)
    worker.start()
worker.join()

My problem is that when I tried to download over 1000 files I always get this error, but I don't know why:

Traceback (most recent call last):
  File "E:/PythonProject/1.1/mp3y.py", line 238, in <module>
    worker.start()
  File "E:\python34\lib\threading.py", line 851, in start
    _start_new_thread(self._bootstrap, ())
RuntimeError: can't start new thread

I tried using a queue, but got the same error.... I wanted part this thread but I don't know how :O

might shed some light http://stackoverflow.com/questions/1834919/error-cant-start-new-thread — Padraic Cunningham, Nov 27 '14 at 00:05

user3426575 · Answer 1 · 2014-11-27T00:40:54.177

There is usually a limit on the maximum number of threads allowed. Depending on your system, this might be anywhere from a few dozen to thousands, but considering the number of files you are intending to download, don't expect you can create the same number of threads.

It is generally not a good idea to start 1000+ threads simultaneously each trying do download a file. Your connection will clog in no time, it's much less efficient than downloading a couple of files at a time, and apart from that, it wastes a lot of server resources and thus isn't considered very sociable.

The pattern used in a situation like this to create a small number of worker threads which each poll a queue.Queue for files to download, then download a file, then poll the queue for the next file. The main program can now feed this queue from the original list, scheduling files for download until all downloads are complete.

A notable exception from this rule is if you are downloading files from a site which artificially throttles download speed. Especially video portals are known for doing so. In this case, it may be appropriate to use a significantly higher number of threads. In one case, when downloading from dailymotion, I found that a number of 20–30 threads worked best for me.

i tryed queue with 200 worker, downloading always block about 220 file (i wanted download 1100 files) code look almost like this for c in range(numThread): worker = Thread(target=search_url) worker.setDaemon(True) worker.start() for c in name: q2.put(c) q2.join() ill test it tommorow now im tired for this — tehzje, Nov 27 '14 at 00:22
Two hundred threads is still way too many. Try 5–10 threads, maybe up to 20 if you have a *really* strong connection. In general, only use more than a few threads if you notice a performance increase doing so. — user3426575, Nov 27 '14 at 00:27

abarnert · Accepted Answer · 2014-12-01T20:07:30.180

Short version:

with concurrent.futures.ThreadPoolExecutor(max_workers=12) as executor:
    for index, d in enumerate(found):
        executor.submit(download, found[d], d, index)

That's it; a trivial change, and two lines less than your existing code, and you're done.

So, what's wrong with your existing code? Starting 1000 threads at a time is always a bad idea.* Once you get beyond a few dozen, you're adding more scheduler and context-switching overhead than you are concurrency savings.

If you want to know why it fails right around 1000, that could be because of a library working around older versions of Windows,**, or it could be because you're running out of stack space,***. But either way, it doesn't really matter. The right solution is to not use so many threads.

The usual solution is to use a thread pool—start about 8-12 threads,**** and have them pull the URLs to download off a queue. You can build this yourself, or you can use the concurrent.futures.ThreadPoolExecutor or multiprocessing.dummy.Pool that come with the stdlib. If you look at the main ThreadPoolExecutor Example in the docs, it's doing almost exactly what you want. In fact, what you want is even simpler, because you don't care about the results.

As a side note, you've got another serious problem in your code. If you daemonize your threads, you're not allowed to join them. Also, you're only trying to join the last one you created, which is by no means guaranteed to be the last one to finish. Also, daemonizing download threads is probably a bad idea in the first place, because when your main thread finishes (after waiting for one arbitrarily-chosen download to finish) the others may get interrupted and leave partial files behind.

Also, if you do want to daemonize a thread, the best way is to pass daemon=True to the constructor. If you need to do it after creation, just do t.daemon = True. Only call the deprecated setDaemon function if you need backward compatibility to Python 2.5.

_{* I guess I shouldn't say always, because in 2025 it'll probably be an everyday thing to do, to take advantage of your thousands of slow cores. But in 2014 on normal laptop/desktop/server hardware, it's always bad.}

_{** Older versions of Windows (at least NT 4) had all kinds of bizarre errors when you got close to 1024 threads, so many threading libraries just refuse to create more than 1000 threads. Although that doesn't seem to be the case here, as Python is just calling Microsoft's own wrapper function _beginthreadex, which doesn't do that.}

_{*** By default, each thread gets 1MB of stack space. And in 32-bit apps, there's a maximum total stack space, which I'd assume defaults to 1GB on your version of Windows. You can customize both the stack space for each thread, or the total process stack space, but Python doesn't customize either, nor do almost any other apps.}

_{**** Unless your downloads are all coming off the same server, in which case you probably want at most 4, and really more than 2 is usually considered impolite if it's not your server. And why 8-12 anyway? It was a rule of thumb that tested well a long time ago. It's probably not optimal anymore, but it's probably close enough for most uses. If you really need to squeeze out a bit more performance, you can test with different numbers.}

"Once you get beyond a few dozen, you're adding more scheduler and context-switching overhead than you are concurrency savings." - not necessarily, if they spend most of their time waiting (e.g. for IO) it could be OK. But yes, there are better programming models than spawning a 1000 threads (e.g. non-blocking IO).... — Karoly Horvath, Nov 27 '14 at 21:04
@KarolyHorvath: You're assuming a kernel with worst-case O(1) scheduling between blocked threads—which is definitely not true on Windows. Watch the CPU usage on a process with 1000 threads that spend most of their time blocked for random periods (without using an IOCP, thread group, kernel thread pool, or other mechanism to help Windows out) and it spends more time in the kernel than in user code. — abarnert, Dec 01 '14 at 20:06
@tehzje: As you discovered from reading the docs, `submit` takes `*args`, not a single list of args; fixed. — abarnert, Dec 01 '14 at 20:08

score 0 · Answer 3 · answered Nov 27 '14 at 00:33

Using a queue will work, but you have to limit the number of worker threads you create. Here is code which uses 100 workers and a Queue to process 1000 items of work:

import Queue
from threading import Thread

def main():
  nworkers = 100
  q = Queue.Queue(1000+nworkers)
  # add the work
  for i in range(1000):
    q.put(i)
  # add the stop signals
  for i in range(nworkers):
    q.put(-1)
  # create and start up the threads
  workers = []
  for wid in range(nworkers):
    w = Thread(target = dowork, args = (q, wid))
    w.start()
    workers.append(w)
  # join all of the workers
  for w in workers: w.join()
  print "All done!"

def dowork(q, wid):
  while True:
    j = q.get()
    if j < 0:
      break
    else:
      print "Worker", wid, "processing item", j
  print "Worker", wid, "exiting"

if __name__ == "__main__":
  main()

Downloading over 1000 files in python

3 Answers3

Linked