1

I am trying to download a list of files in parallel, making use of [gevent][1]

My code is a slight modification of the code suggested here

monkey.patch_all()

def download_xbrl_files(download_folder, yq, list_of_xbrl_urls):
    def download_and_save_file(url, yr, qtr):
        if url is not None:
            full_url = "http://edgar.sec.gov" + url
            if not os.path.exists(full_url):
                try:
                    content = urllib2.urlopen(full_url).read()
                    filename = download_folder + "/" + str(y) + "/" + q + "/" + url.split('/')[-1]
                    print "Saving: ", filename
                    f_raw = open(filename, "w")
                    f = FileObject(f_raw, "w")
                    try:
                        f.write(content)
                    finally:
                        f.close()
                        return 'Done'
                except:
                    print "Warning: can't save or access for item:", url
                    return None
            else:
                return 'Exists'
        else:
            return None
    (y, q) = yq
    if utls.has_elements(list_of_xbrl_urls):
        filter_for_none = filter(lambda x: x is not None, list_of_xbrl_urls)
        no_duplicates = list(set(filter_for_none))
        download_files = [gevent.spawn(lambda x: download_and_save_file(x, y, q), x) for x in no_duplicates]
        gevent.joinall(download_files)
        return 'completed'
    else:
        return 'empty'

What the code does is:

  1. after some cleaning
  2. gevent.spawn spawns download_and_save_file which:
  3. checks if the file has been already downloaded
  4. if not, downloads the content with urllib2.urlopen(full_url).read()
  5. saves the file with the help of gevent's FileObject

I have the impression that the download_and_save only works sequentially. Furthermore, my application gets in a stand-by. I could add a timeout but I wanted to handle failures gracefully within my code.

Wondering if I am doing something wrong - it's the first time I write code in python.

EDIT

Here is a version of the code using "Threads"

def download_xbrl_files(download_folder, yq_and_url):
    (yq, url) = yq_and_url
    (yr, qtr) = yq
    if url is not None and url is not '':
        full_url = "http://edgar.sec.gov" + url
        filename = download_folder + "/" + str(yr) + "/" + qtr + "/" + url.split('/')[-1]
        if not os.path.exists(filename):
            try:
                content = urllib2.urlopen(full_url).read()
                print "Saving: ", filename
                f = open(filename, "wb")
                try:
                    f.write(content)
                    print "Writing done: ", filename
                finally:
                    f.close()
                    return 'Done'
            except:
                print "Warning: can't save or access for item:", url
                return None
        else:
            print "Exists: ", filename
            return 'Exists'
    else:
        return None


def download_filings(download_folder, yq_and_filings):
    threads = [threading.Thread(target=download_xbrl_files, args=(download_folder, x,)) for x in yq_and_filings]
    [thread.start() for thread in threads]
    [thread.join() for thread in threads]
Community
  • 1
  • 1
NoIdeaHowToFixThis
  • 4,484
  • 2
  • 34
  • 69
  • I don't know if this has an impact, but when trying to run your code, I saw that you have yr and qtr in your function definition but are using the variable names y and q? – cchristelis May 23 '14 at 08:48
  • Thanks for looking into this. No, the `y` and `q` are defined within `download_xbrl_files` and then passed as parameters to `download_and_save_file` in the lambda within `gevent.spawn`. The issue is really with the way `gevent.spawn` works afaik and as the user who posted an answer earlier on pointed out (where did that answer go?) – NoIdeaHowToFixThis May 23 '14 at 09:00
  • But why do you have yr and qtr in your download_and_save_file function I can't see where they are used? (The y and q you are using is outside the function's scope...) – cchristelis May 23 '14 at 09:04
  • Oh, you're right. That's a bug introduced by the refactoring I made to have a "smaller" code example to post here on SO. – NoIdeaHowToFixThis May 23 '14 at 09:07
  • @cchristelis: fixed the bug but the general problem with gevent.spawn is not solved – NoIdeaHowToFixThis May 23 '14 at 09:10

1 Answers1

1

I looked into this a little deeper the basic problem is that gevent.spawn() creates greenlets not processes (all greenlets run in a single OS thread).

Try a simple:

import gevent
from time import sleep
g = [gevent.spawn(sleep, 1) for x in range(100)]
gevent.joinall(g)

You'll see the time this takes is 100s. Which proves the point above.

You are really looking for multithreading which can be found in the threading module. Have a look at the the question at: How to use threading in Python?. For a little how to.

---update---

Here is a quick example of how you might do this:

threads = [threading.Thread(target=sleep, args=(1,)) for x in range(10)]
[thread.start() for thread in threads]
[thread.join() for thread in threads]
Community
  • 1
  • 1
cchristelis
  • 1,985
  • 1
  • 13
  • 17
  • OK, I see. I see in the example that they have only urlopen read. Can I use file `write` in conjunction with thread.start and thread.join? – NoIdeaHowToFixThis May 23 '14 at 09:41
  • Should be quite a simple technology exchange, you are just swapping your threading library, so yeah, I don't see a reason why you would need to change the rest of your code too much. – cchristelis May 23 '14 at 09:45
  • @NoIdeaHowToFixThis I updated my answer to give you an idea how such a swap may look. Hope this helps. – cchristelis May 23 '14 at 09:54
  • I am now using threads (see my edit to the original question) but I still can't see parallelism. Does this have something to do with the Python GIL? – NoIdeaHowToFixThis May 23 '14 at 10:16