Speeding up process speed of file downloads from the web

Question

I'm writing a program that has to download a bunch of files from the web before it can even run, so I created a function that will download all the files and "initialize" the program called init_program, how it works is it runs through a couple dicts that have urls to a gistfiles on github. It pulls the urls and uses urllib2 to download them. I won't be able to add all the files but you can try it out by cloning the repo here. Here's the function that will create the files from the gists:

def init_program():
    """ Initialize the program and allow all the files to be downloaded
        This will take awhile to process, but I'm working on the processing
        speed """

    downloaded_wordlists = []  # Used to count the amount of items downloaded
    downloaded_rainbow_tables = []

    print("\n")
    banner("Initializing program and downloading files, this may take awhile..")
    print("\n")

    # INIT_FILE is a file that will contain "false" if the program is not initialized
    # And "true" if the program is initialized
    with open(INIT_FILE) as data: 
        if data.read() == "false": 
            for item in GIST_DICT_LINKS.keys():
                sys.stdout.write("\rDownloading {} out of {} wordlists.. ".format(len(downloaded_wordlists) + 1, 
                                                                                  len(GIST_DICT_LINKS.keys())))
                sys.stdout.flush()
                new_wordlist = open("dicts/included_dicts/wordlists/{}.txt".format(item), "a+") 
                # Download the wordlists and save them into a file
                wordlist_data = urllib2.urlopen(GIST_DICT_LINKS[item])
                new_wordlist.write(wordlist_data.read())
                downloaded_wordlists.append(item + ".txt")
                new_wordlist.close()

            print("\n")
            banner("Done with wordlists, moving to rainbow tables..")
            print("\n")

            for table in GIST_RAINBOW_LINKS.keys():
                sys.stdout.write("\rDownloading {} out of {} rainbow tables".format(len(downloaded_rainbow_tables) + 1, 
                                                                                    len(GIST_RAINBOW_LINKS.keys())))
                new_rainbowtable = open("dicts/included_dicts/rainbow_tables/{}.rtc".format(table))
                # Download the rainbow tables and save them into a file
                rainbow_data = urllib2.urlopen(GIST_RAINBOW_LINKS[table])
                new_rainbowtable.write(rainbow_data.read())
                downloaded_rainbow_tables.append(table + ".rtc")
                new_rainbowtable.close()

            open(data, "w").write("true").close()  # Will never be initialized again
        else:
            pass

    return downloaded_wordlists, downloaded_rainbow_tables

This works, yes, however it's extremely slow, due to the size of the files, each file has at least 100,000 lines in it. How can I speed up this function to make it faster and more user friendly?

Hmmm, It depends on your wifi connection. There is almost no way you can speed this up except improve your wifi. Sorry to say. — Qwerty, Dec 07 '16 at 02:26
@Qwerty even with threading? I mean this is slow, yeah it will be worth it in the end, but it's a slow initialization process.. — papasmurf, Dec 07 '16 at 02:29

GustavoIP · Accepted Answer · 2016-12-07T03:34:00.367

Some weeks ago I faced a similar situation where it was needed to download many huge files but all simple pure Python solutions that I found was not good enough in terms of download optimization. So I found Axel — Light command line download accelerator for Linux and Unix

What is Axel?

Axel tries to accelerate the downloading process by using multiple connections for one file, similar to DownThemAll and other famous programs. It can also use multiple mirrors for one download.

Using Axel, you will get files faster from Internet. So, Axel can speed up a download up to 60% (approximately, according to some tests).

Usage: axel [options] url1 [url2] [url...]

--max-speed=x       -s x    Specify maximum speed (bytes per second)
--num-connections=x -n x    Specify maximum number of connections
--output=f      -o f    Specify local output file
--search[=x]        -S [x]  Search for mirrors and download from x servers
--header=x      -H x    Add header string
--user-agent=x      -U x    Set user agent
--no-proxy      -N  Just don't use any proxy server
--quiet         -q  Leave stdout alone
--verbose       -v  More status information
--alternate     -a  Alternate progress indicator
--help          -h  This information
--version       -V  Version information

As axel is written in C and there's no C extension for Python, so I used the subprocess module to execute him externally and works perfectly for me.

You can do something like this:

cmd = ['/usr/local/bin/axel', '-n', str(n_connections), '-o',
               "{0}".format(filename), url]
process = subprocess.Popen(cmd,stdin=subprocess.PIPE, stdout=subprocess.PIPE)

You can also parse the progress of each download parsing the output of the stdout.

    while True:
        line = process.stdout.readline()
        progress = YOUR_GREAT_REGEX.match(line).groups()
        ...

This only works if the hosting site supports parallel downloads — OneCricketeer, Dec 07 '16 at 03:28
It's true, but can be useful in 'the most' cases. But it's not a silver bullet, unfortunately. — GustavoIP, Dec 07 '16 at 03:30
@GustavolP I'm also working on a Windows machine.. This is a genius work around though so +1 — papasmurf, Dec 07 '16 at 12:38

score 0 · Answer 2 · answered Dec 07 '16 at 08:20

0

You're blocking whilst you wait for each download. So the total time is the sum of the round trip time for each download. Your code will likely spend a lot of time waiting for the network traffic. One way to improve this is not to block whilst you wait for each response. You can do this in several ways. For example by handing off each request to a separate thread (or process), or by using an event loop and coroutines. Read up on the threading and asyncio modules.

answered Dec 07 '16 at 08:20

Paul Rudin

17
1
7

Elaborate what you mean by blocking while waiting for each download? – papasmurf Dec 07 '16 at 12:38
1

urlopen() followed by read() means that you're waiting for a connection to be opened, the request to be sent and the response to arrive. This network traffic is likely to a significant amount of time, and most of the time taken by your code is waiting for the network traffic. When you've got lots of requests to make you don't want to wait for the response to the first, before you initiate the next. – Paul Rudin Dec 08 '16 at 11:55
So how do you propose I do that? Create a queue of threads, and just pull them when I need them? – papasmurf Dec 08 '16 at 12:01

Speeding up process speed of file downloads from the web

2 Answers2