3

I have a python generator that does work that produces a large amount of data, which uses up a lot of ram. Is there a way of detecting if the processed data has been "consumed" by the code which is using the generator, and if so, pause until it is consumed?

def multi_grab(urls,proxy=None,ref=None,xpath=False,compress=True,delay=10,pool_size=50,retries=1,http_obj=None):
    if proxy is not None:
        proxy = web.ProxyManager(proxy,delay=delay)
        pool_size = len(pool_size.records)
    work_pool = pool.Pool(pool_size)
    partial_grab = partial(grab,proxy=proxy,post=None,ref=ref,xpath=xpath,compress=compress,include_url=True,retries=retries,http_obj=http_obj)
    for result in work_pool.imap_unordered(partial_grab,urls):
        if result:
            yield result

run from:

if __name__ == '__main__':
    links = set(link for link in grab('http://www.reddit.com',xpath=True).xpath('//a/@href') if link.startswith('http') and 'reddit' not in link)
    print '%s links' % len(links)
    counter = 1
    for url, data in multi_grab(links,pool_size=10):
        print 'got', url, counter, len(data)
        counter += 1
Chris Pfohl
  • 18,220
  • 9
  • 68
  • 111
Matt
  • 101
  • 2
  • 5
  • 1
    You'll have to show us the code. But generators only are able to work if something is calling their `next()` method; they can't produce on their own, so the solution is for whatever is iterating over them to consume them to stop until they've consumed what they've already taken. – agf Aug 09 '11 at 19:40

4 Answers4

8

A generator simply yields values. There's no way for the generator to know what's being done with them.

But the generator also pauses constantly, as the caller does whatever it does. It doesn't execute again until the caller invokes it to get the next value. It doesn't run on a separate thread or anything. It sounds like you have a misconception about how generators work. Can you show some code?

Ned Batchelder
  • 364,293
  • 75
  • 561
  • 662
  • https://github.com/mattseh/python-web/blob/master/web.py the multi_grab method and the example under if __name__ == '__main__': I am consuming another generator (work_pool.imap_unordered), which is using threads (greenlets). – Matt Aug 09 '11 at 19:41
  • gevent is introducing asyncronousity (!) into your code, and those threads are able to work independently of the caller pulling values. I'm not sure how best to get the effect you need. – Ned Batchelder Aug 09 '11 at 19:48
  • Seems I need to create my own version of imap_unordered. Thank you for your input! – Matt Aug 09 '11 at 19:51
2

The point of a generator in Python is to get rid of extra, unneeded objects after each iteration. The only time it will keep those extra objects (and thus extra ram) is when the objects are being referenced somewhere else (such as adding them to a list). Make sure you aren't saving these variables unnecessarily.

If you're dealing with multithreading/processing, then you probably want to implement a Queue that you could pull data from, keeping track of the number of tasks you're processing.

TorelTwiddler
  • 5,996
  • 2
  • 32
  • 39
0

I think you may be looking for the yield function. Explained in another StackOverflow question: What does the "yield" keyword do in Python?

Community
  • 1
  • 1
Johan Kotlinski
  • 25,185
  • 9
  • 78
  • 101
0

A solution could be to use a Queue to which the generator would add data, while another part of the code would get data from it and process it. This way you could ensure that there is no more than n items in memory at the same time.

mdeous
  • 17,513
  • 7
  • 56
  • 60