20

I'm looking for a python library or a command line tool for downloading multiple files in parallel. My current solution is to download the files sequentially which is slow. I know you can easily write a half-assed threaded solution in python, but I always run into annoying problem when using threading. It is for polling a large number of xml feeds from websites.

My requirements for the solution are:

  1. Should be interruptable. Ctrl+C should immediately terminate all downloads.
  2. There should be no leftover processes that you have to kill manually using kill, even if the main program crashes or an exception is thrown.
  3. It should work on Linux and Windows too.
  4. It should retry downloads, be resilient against network errors and should timeout properly.
  5. It should be smart about not hammering the same server with 100+ simultaneous downloads, but queue them in a sane way.
  6. It should handle important http status codes like 301, 302 and 304. That means that for each file, it should take the Last-Modified value as input and only download if it has changed since last time.
  7. Preferably it should have a progress bar or it should be easy to write a progress bar for it to monitor the download progress of all files.
  8. Preferably it should take advantage of http keep-alive to maximize the transfer speed.

Please don't suggest how I may go about implementing the above requirements. I'm looking for a ready-made, battle-tested solution.

I guess I should describe what I want it for too... I have about 300 different data feeds as xml formatted files served from 50 data providers. Each file is between 100kb and 5mb in size. I need to poll them frequently (as in once every few minutes) to determine if any of them has new data I need to process. So it is important that the downloader uses http caching to minimize the amount of data to fetch. It also uses gzip compression obviously.

Then the big problem is how to use the bandwidth in an as efficient manner as possible without overstepping any boundaries. For example, one data provider may consider it abuse if you open 20 simultaneous connections to their data feeds. Instead it may be better to use one or two connections that are reused for multiple files. Or your own connection may be limited in strange ways.. My isp limits the number of dns lookups you can do so some kind of dns caching would be nice.

Deanna
  • 23,876
  • 7
  • 71
  • 156
Björn Lindqvist
  • 19,221
  • 20
  • 87
  • 122

10 Answers10

8

You can try pycurl, though the interface is not easy at first, but once you look at examples, its not hard to understand. I have used it to fetch 1000s of web pages in parallel on meagre linux box.

  1. You don't have to deal with threads, so it terminates gracefully, and there are no processes left behind
  2. It provides options for timeout, and http status handling.
  3. It works on both linux and windows.

The only problem is that it provides a basic infrastructure (basically just a python layer above the excellent curl library). You will have to write few lines to achieve the features as you want.

Pankaj
  • 3,592
  • 2
  • 26
  • 22
5

There are lots of options but it will be hard to find one which fits all your needs.

In your case, try this approach:

  1. Create a queue.
  2. Put URLs to download into this queue (or "config objects" which contain the URL and other data like the user name, the destination file, etc).
  3. Create a pool of threads
  4. Each thread should try to fetch a URL (or a config object) from the queue and process it.

Use another thread to collect the results (i.e. another queue). When the number of result objects == number of puts in the first queue, then you're finished.

Make sure that all communication goes via the queue or the "config object". Avoid accessing data structures which are shared between threads. This should save you 99% of the problems.

Aaron Digulla
  • 321,842
  • 108
  • 597
  • 820
  • *Please don't suggest how I may go about implementing the above requirements. I'm looking for a ready-made, battle-tested solution.* Seems you're doing exactly that. – GaretJax Aug 01 '11 at 05:26
  • 3
    GaretJax: If you look at the edit history, the answer above was made ten minutes before that sentence was added by the asker. – Peter O. Aug 02 '11 at 18:50
  • @Peter O.: You're right, sorry about that. – GaretJax Aug 02 '11 at 20:28
2

I don't think such a complete library exists, so you'll probably have to write your own. I suggest taking a look at gevent for this task. They even provide a concurrent_download.py example script. Then you can use urllib2 for most of the other requirements, such as handling HTTP status codes, and displaying download progress.

Community
  • 1
  • 1
Gregg
  • 3,236
  • 20
  • 15
2

I would suggest Twisted, although it is not a ready made solution, but provides the main building blocks to get every feature you listed in an easy way and it does not use threads.

If you are interested, take a look at the following links:

As per your requirements:

  1. Supported out of the box
  2. Supported out of the box
  3. Supported out of the box
  4. Timeout supported out of the box, other error handling done through deferreds
  5. Achieved easily using cooperators (example 7)
  6. Supported out of the box
  7. Not supported, solutions exists (and they are not that hard to implement)
  8. Not supported, it can be implemented (but it will be relatively hard)
GaretJax
  • 7,462
  • 1
  • 38
  • 47
2

Nowadays there are excellent Python libs you might want to use - urllib3 and requests

Piotr Dobrogost
  • 41,292
  • 40
  • 236
  • 366
1

Try using aria2 through simple python subprocess module. It provide all requirements from your list, except 7, out of the box, and 7 is easy to write. aria2c has a nice xml-rpc or json-rpc interface to interact with it from your scripts.

Alex Laskin
  • 1,127
  • 5
  • 18
0

Does urlgrabber fit your requirements?

http://urlgrabber.baseurl.org/

If it doesn't, you could consider volunteering to help finish it. Contact the authors, Michael Stenner and Ryan Tomayko.

Update: Googling for "parallel wget" yields these, among others:

http://puf.sourceforge.net/

http://www.commandlinefu.com/commands/view/3269/parallel-file-downloading-with-wget

It seems like you have a number of options to choose from.

Shane Hathaway
  • 2,141
  • 1
  • 13
  • 9
  • Thanks but those links fail on 4, 5, 8 and especially 6. The issue for me is not to throw up a number of processes to do downloads, but to handle http cacheing and have some "global control" system so that the downloads are done in an as efficient was as possible. – Björn Lindqvist Jul 28 '11 at 08:07
  • If I were writing this and all these requirements were non-negotiable (as you seem to suggest), I would almost certainly write my own downloader based on Twisted. – Shane Hathaway Jul 29 '11 at 00:32
0

I used the standard libs for that, urllib.urlretrieve to be precise. downloaded podcasts this way, via a simple thread pool, each using its own retrieve. I did about 10 simultanous connections, more should not be a problem. Continue a interrupted download, maybe not. Ctrl-C could be handled, I guess. Worked on Windows, installed a handler for progress bars. All in all 2 screens of code, 2 screens for generating the URLs to retrieve.

towi
  • 21,587
  • 28
  • 106
  • 187
0

This seems pretty flexible:

http://keramida.wordpress.com/2010/01/19/parallel-downloads-with-python-and-gnu-wget/

Evan Mulawski
  • 54,662
  • 15
  • 117
  • 144
-1

Threading isn't "half-assed" unless you're a bad programmer. The best general approach to this problem is the producer / consumer model. You have one dedicated URL producer, and N dedicated download threads (or even processes if you use the multiprocessing model).

As for all of your requirements, ALL of them CAN be done with the normal python threaded model (yes, even catching Ctrl+C -- I've done it).

Chris Eberle
  • 47,994
  • 12
  • 82
  • 119
  • 2
    Apparently multiprocessing is better for concurrency in Python due to the global interpreter lock, but assuming the GIL timing is quick enough, threading would probably work fine for this sort of thing, what with the latency you'll generally get and how each thread will be blocking for I/O access anyway as they get more data in. I'm no expert, though, so multiple processes may still be better for this situation in Python. – JAB Jul 19 '11 at 18:36
  • 2
    Yeah, since everything is IO bound anyways the GIL won't have a noticeable impact. – Chris Eberle Jul 19 '11 at 20:48