16

How do I yield an object from a generator and forget it immediately, so that it doesn't take up memory?

For example, in the following function:

def grouper(iterable, chunksize):
    """
    Return elements from the iterable in `chunksize`-ed lists. The last returned
    element may be smaller (if length of collection is not divisible by `chunksize`).

    >>> print list(grouper(xrange(10), 3))
    [[0, 1, 2], [3, 4, 5], [6, 7, 8], [9]]
    """
    i = iter(iterable)
    while True:
        chunk = list(itertools.islice(i, int(chunksize)))
        if not chunk:
            break
        yield chunk

I don't want the function to hold on to the reference to chunk after yielding it, as it is not used further and just consumes memory, even if all outside references are gone.


EDIT: using standard Python 2.5/2.6/2.7 from python.org.


Solution (proposed almost simultaneously by @phihag and @Owen): wrap the result in a (small) mutable object and return the chunk anonymously, leaving only the small container behind:

def chunker(iterable, chunksize):
    """
    Return elements from the iterable in `chunksize`-ed lists. The last returned
    chunk may be smaller (if length of collection is not divisible by `chunksize`).

    >>> print list(chunker(xrange(10), 3))
    [[0, 1, 2], [3, 4, 5], [6, 7, 8], [9]]
    """
    i = iter(iterable)
    while True:
        wrapped_chunk = [list(itertools.islice(i, int(chunksize)))]
        if not wrapped_chunk[0]:
            break
        yield wrapped_chunk.pop()

With this memory optimization, you can now do something like:

 for big_chunk in chunker(some_generator, chunksize=10000):
     ... process big_chunk
     del big_chunk # big_chunk ready to be garbage-collected :-)
     ... do more stuff
Radim
  • 4,208
  • 3
  • 27
  • 38
  • 5
    Often people from other language backgrounds want to micro-manage storage allocation and deallocation; indeed most of the work in those other languages goes expressly toward that end. Resist the temptation to make guesses about memory efficiency of a particular Python construct: memory management is too important to be left to the coder :p – msw Aug 20 '11 at 17:41
  • Thx but unfortunately, I need the memory that is wasted this way. The chunks are big. Currently, I use a workaround (hoisting the loop up into caller's local scope, i.e. not having a separate generator function), but I was wondering if there is a better/cleaner way. – Radim Aug 20 '11 at 17:48
  • 1
    If you really need this kind of control then you are using the wrong language. But in my experience, the overwhelming majority of people who think they need this kind of control really don't. – Karl Knechtel Aug 20 '11 at 18:09
  • 1
    @Karl Knechtel: it is a common misconception that Python somehow cannot handle large data. In fact, Python is rather good at it -- precisely because of its built-in support for generators and co-routines. – Radim Aug 20 '11 at 21:22
  • I didn't say anything about handling large data. I was talking about memory management. Generators and co-routines can only get you so far, especially if you're planning to collate their output with a large `chunksize`. – Karl Knechtel Aug 20 '11 at 22:03
  • @Radim Please, see the question I ask you in the post I wrote. (Comment from anyone else welcome too) – eyquem Aug 21 '11 at 07:04
  • the part that is missing in the question and all answers is memory measurements that show whether the solutions have any impact on the actual memory consumption at all [example](http://stackoverflow.com/questions/11957539/python-memory-not-being-given-back-to-kernel). – jfs Sep 19 '12 at 13:03

5 Answers5

11

After yield chunk, the variable value is never used again in the function, so a good interpreter/garbage collector will already free chunk for garbage collection (note: cpython 2.7 seems not do this, pypy 1.6 with default gc does). Therefore, you don't have to change anything but your code example, which is missing the second argument to grouper.

Note that garbage collection is non-deterministic in Python. The null garbage collector, which doesn't collect free objects at all, is a perfectly valid garbage collector. From the Python manual:

Objects are never explicitly destroyed; however, when they become unreachable they may be garbage-collected. An implementation is allowed to postpone garbage collection or omit it altogether — it is a matter of implementation quality how garbage collection is implemented, as long as no objects are collected that are still reachable.

Therefore, it can not be decided whether a Python program does or "doesn't take up memory" without specifying Python implementation and garbage collector. Given a specific Python implementation and garbage collector, you can use the gc module to test whether the object is freed.

That being said, if you really want no reference from the function (not necessarily meaning the object will be garbage-collected), here's how to do it:

def grouper(iterable, chunksize):
    i = iter(iterable)
    while True:
        tmpr = [list(itertools.islice(i, int(chunksize)))]
        if not tmpr[0]:
            break
        yield tmpr.pop()

Instead of a list, you can also use any other data structure that with a function which removes and returns an object, like Owen's wrapper.

Community
  • 1
  • 1
phihag
  • 278,196
  • 72
  • 453
  • 469
  • Well with a literal reading of the code it's not overwritten until the consumer gives up control again. – Owen Aug 20 '11 at 16:53
  • @Owen Updated the answer to clarify. overwritten was the wrong term, not being used (i.e. can not possibly be used) is better. – phihag Aug 20 '11 at 16:54
  • Ah ok. And I just tested it out and yes Python does delete it when you say. – Owen Aug 20 '11 at 16:56
  • 1
    @Owen I don't think *Python* specifies whether and when the value of `chunk` has to be garbage-collected. The specific Python *implementation* (like cpython, PyPy, Jython) and garbage collector just happen to implement it that way. – phihag Aug 20 '11 at 17:06
  • I don't see that behaviour on Python 2.5.6; how are you testing this? – Radim Aug 20 '11 at 17:12
  • @phihag _"variable value"_ ? What does it mean, _"variable value"_ ? - Also, after the ``yield chunk``, the object **chunk** still exists, so what do you mean with _"After ``yield chunk``, the variable value is never used again in the function,"_ ? – eyquem Aug 20 '11 at 17:16
  • @Owen What kind of test did you perform, please ? – eyquem Aug 20 '11 at 17:19
  • @Radim What do you mean with Python 2.5.6. Do you mean cPython 2.5.6? – phihag Aug 20 '11 at 17:19
  • @phihag I think you are in the widely common misunderstanding of the term 'garbage collector'. As far as I understood, the garbage collector isn't the process that dereferences the objects. – eyquem Aug 20 '11 at 17:22
  • If you run https://gist.github.com/1159371 and look at the order of the statements printed, you'll see that the object `a` is destroyed before control reaches the consumer. cPython 2.7.1+ – Owen Aug 20 '11 at 17:22
  • Yes, standard distribution from python.org. I also tested with Python 2.6.4 and I don't see any optimized deleting there either. – Radim Aug 20 '11 at 17:23
  • @eyquem With variable value, I meant the current value of the variable. Note that it's misleading to speak of an *object* chunk; objects and references to objects are two distinct concepts. After the `yield chunk`, there is no way to access the value stored in `chunk` from this function. Therefore, this function does not hold any references to the object in question, and therefore it can be garbage-collected. – phihag Aug 20 '11 at 17:24
  • @Owen: I'm afraid the object gets dereferenced (and eventually deleted) when it is overwritten in the next iteration, not when it's yielded from the local scope. That is, it doesn't do what I need (on my comp). – Radim Aug 20 '11 at 17:26
  • @eyquem I tried to clarify the wording in the above statement. The garbage collector is in charge of determining when there are no references to an object (i.e. it's impossible to access it from Python) and then clean it up. The *null* garbage collector, which just does nothing, is perfectly valid. – phihag Aug 20 '11 at 17:26
  • 1
    keep in mind if someone is actually using the yielded result (instead of just throwing it away like in my test program), it won't be deleted because there will still be a reference to it; and you wouldn't want it deleted either. – Owen Aug 20 '11 at 17:29
  • Updated the answer with a primer on garbage collection. – phihag Aug 20 '11 at 17:33
  • It seems there is some confusion, so let me repeat the question: how do I make the function **not** hold any reference to the yielded object, so it can be garbage collected. Cheers. – Radim Aug 20 '11 at 17:39
  • @Radim It depends on the Python implementation. With cpython 2.7, there doesn't seem to be a way to do so. If you want fine-grained control over memory management, use a language without garbage collection, like C, not Python. – phihag Aug 20 '11 at 17:51
  • @phihag: Yeah I couldn't find a way either, except getting rid of the function and chunking directly in caller's local scope (that's how I "solved" it for now). But hopefully some Python guru will come forward with a nice hack :) – Radim Aug 20 '11 at 17:54
  • @Radim My mistake, it **is** possible to return the chunk without leaving a reference behind. Updated the answer with a solution, and updated the linked demo with the solution and a better item generator than a list. – phihag Aug 20 '11 at 18:10
  • @phihag Looks like we realized how to do it about the same time ;) – Owen Aug 20 '11 at 18:13
  • @Owen Yup, updated to return the reference :D. The whole buildup about garbage collectors remains important though, the "solution" only works with a specific Python implementation. – phihag Aug 20 '11 at 18:16
  • @phihag: Thx, the proposed code is very elegant! But the first paragraph of your answer is still incorrect; would you mind changing that, in case someone comes across this thread later? – Radim Aug 20 '11 at 18:55
  • @Radim The first paragraph is still correct. `cpython` is not a *good* Python interpreter (Its garbage collector uses reference counting!). Amended the answer. – phihag Aug 20 '11 at 19:07
  • @phihag Did you see the question I asked you in my "answer" ? – eyquem Aug 21 '11 at 18:29
  • @eyquem Sorry, I didn't. Added three comments. You're mostly right ;), and fully so if you don't assume code to be mutable while it's being executed. – phihag Aug 22 '11 at 08:27
4

If you really really want to get this functionality I suppose you could use a wrapper:

class Wrap:

    def __init__(self, val):
        self.val = val

    def unlink(self):
        val = self.val
        self.val = None
        return val

And could be used like

def grouper(iterable, chunksize):
    i = iter(iterable)
    while True:
        chunk = Wrap(list(itertools.islice(i, int(chunksize))))
        if not chunk.val:
            break
        yield chunk.unlink()

Which is essentially the same as what phihag does with pop() ;)

Owen
  • 38,836
  • 14
  • 95
  • 125
  • Great! Wrapping seems like a good direction. Can you show how to use this pattern with the `grouper` fnc, for example? If it works, you'll have my gratitude (and an accept) :) – Radim Aug 20 '11 at 18:08
  • 1
    Nice, I actually use a similar technique (wrapping in a fake mutable) elsewhere, why didn't I think of it myself :) It seems you got there first, a few minutes before @phihag, and with fewer sidetracks, so I'll accept this. Big thanks to both of you! – Radim Aug 20 '11 at 18:30
2

@ Radim,

Several points were perplexing me in this thread. I realize that I was missing to understand the base: what was your problem.

Now I think that I've understood and I whish you to confirm.

I'll represent your code like that

import itertools

def grouper(iterable, chunksize):
    i = iter(iterable)
    while True:
        chunk = list(itertools.islice(i, int(chunksize)))
        if not chunk:
            break
        yield chunk

............
............
gigi = grouper(an_iterable,4)
# before A
# A = grouper(an_iterable,4)
# corrected:
A = gigi.next()
# after A
................
...........
# deducing an object x from A ; x doesn't consumes a lot of memory
............
# deleting A because it consumes a lot of memory:
del A
# code carries on, taking time to executes
................
................
......
..........
# before B
# B = grouper(an_iterable,4)
# corrected:
B = gigi.next()
# after B
.....................
........

Your problem is that even during the time elapsing between
# after deletion of A, code carries on, taking time to executes
and
# before B ,
the object of deleted name 'A' still exists and consumes a lot of memory because there is still a binding between this object and the identifier 'chunk' inside the generator function ?

Excuse me to ask you about this now evident point to me.
However, as there was a certain confusion in the thread at a time, I'd like you to confirm I have now correctly understood your problem.

.

@ phihag

You wrote in a comment:

1)
After the yield chunk, there is no way to access the value stored in chunk from this function. Therefore, this function does not hold any references to the object in question

(By the way, I wouldn't have written therefore , but 'because')

I think that this affirmation #1 is debatable.
In fact , I'm convinced it is false. But there is a subtlety in what you pretend, not in this quotation alone, but globally, if we take account of what you say in the beginning of your answer too.

Let us take things in order.

The following code seems to prove the contrary of your affirmation "After the yield chunk, there is no way to access the value stored in chunk from this function."

import itertools

def grouper(iterable, chunksize):
    i = iter(iterable)
    chunk = ''
    last = ''
    while True:
        print 'new turn   ',id(chunk)
        if chunk:
            last = chunk[-1]
        chunk = list(itertools.islice(i, int(chunksize)))
        print 'new chunk  ',id(chunk),'  len of chunk :',len(chunk)
        if not chunk:
            break
        yield '%s  -  %s' % (last,' , '.join(chunk))
        print 'end of turn',id(chunk),'\n'


for x in grouper(['1','2','3','4','5','6','7','8','9','10','11'],'4'):
    print repr(x)

result

new turn    10699768
new chunk   18747064   len of chunk : 4
'  -  1 , 2 , 3 , 4'
end of turn 18747064 

new turn    18747064
new chunk   18777312   len of chunk : 4
'4  -  5 , 6 , 7 , 8'
end of turn 18777312 

new turn    18777312
new chunk   18776952   len of chunk : 3
'8  -  9 , 10 , 11'
end of turn 18776952 

new turn    18776952
new chunk   18777512   len of chunk : 0

.

However, you also wrote (it's the beginning of your answer):

2)
After yield chunk, the variable value is never used again in the function, so a good interpreter/garbage collector will already free chunk for garbage collection (note: cpython 2.7 seems not do this, pypy 1.6 with default gc does).

This time you don't say that the function hold no more reference of chunk after yield chunk , you say that its value is not used again before the renewal of chunk in the next turn of the while loop. That's right, in the Radim's code, the object chunk isn't used again before the identifier 'chunk' is re-assigned in the instruction chunk = list(itertools.islice(i, int(chunksize))) in the next turn of the loop.

.

This affirmation #2 in this quotation, different from the preceding #1 one, has two logical consequences:

FIRST , my above code can't pretend to prove strictly to someone thinking like you that there is indeed a way to access the value of chunk after the yield chunk instruction.
Because the conditions in my above code are not the same under which you affirm the contrary, that is to say: in Radim's code about which you speak, the object chunk is really not used again before the next turn.
And then , one can pretend that it's because of the use of chunk in my above code ( the instructions print 'end of turn',id(chunk),'\n' , print 'new turn ',id(chunk) and last = chunk[-1] do use it ) that it happens that a reference to the object chunk is still hold after the yield chunk.

SECONDLY, going further in the reasoning, gathering your two quotations leads to conclude that you think it's because chunk is no more used after the yield chunk instruction in the Radim's code that no reference is maintained on it.
It's a matter of logic, IMO: the absence of reference to an object is the condition of its freeing, hence if you pretend that the memory is freed from the object because it is no more used, it's equivalent to pretend that the memory is freed from the object because its unemployment makes the intepreter to delete the reference to it in the function.

I sum up:
you pretend that in Radim's code, chunk is no more used after yield chunk then no more reference to it is hold, then..... cpython 2.7 won't do it... but pypy 1.6 with default gc frees the memory from the object chunk.

At this point , I'm very surprised by the reasoning at the source of this consequence: it would be because of the fact that chunk is no more used that pypy 1.6 would free it. This reasoning isn't clearly expressed like that by you, but without it I would find what you claim in the two quotations being illogical and incomprehensible.

What perplexes me in this conclusion, and the reason I don't agree with all that, is that it implies that pypy 1.6 would be able to analyze the code and detect that chunk won't be used again after yield chunk. I find this idea completely unbelievable and I would like you :

  • to explain what you exactly think about all that. Where am I wrong in the comprehension of your ideas ?

  • to say if you have a proof of the fact that , at least pypy 1.6, doesn't hold reference to chunk when it is no more used.
    The problem of Radim's initial code is that the memory was too much consumed by the persistance of the object chunk because of its reference still hold inside the generator function: that was an indirect symptom of the existence of such a persistent reference inside.
    Have you observed a similar behavior with pypy 1.6 ? I don't see another way to put in evidence the remaining reference inside the generator, since , according to your quotation #2, any use of chunk after yield chunk is enough to trigger the upholding of a reference to it. It's a problem similar to one in quantic mechanics: the fact to measure the speed of a particle modifies its speed.....

eyquem
  • 26,771
  • 7
  • 38
  • 46
  • Yes, exactly. The extra dangling reference from inside the function is preventing memory from being freed up. Your first example shows this nicely (except it's the chunks, i.e. objects returned by `A.next()`, that are of memory concern here, not `A` itself). I thought it's obvious, but I am not used to SO, I realize I should have been clearer with explanation. People jump at answers very quickly here, so one has to be very very careful not give any opportunity for misreading :) My fault. – Radim Aug 21 '11 at 11:36
  • @Radim Your fault is minor. Yes, it was obvious and I was dull not to understand the problem at first glance and a little ashamed to ask for confirmation. But as you remarked in a comment, things were tangled up at a certain time in the discussion. - It's a relief to meet a user who is conscious of the fact that _"people jump at answers very quickly here"_ that possibly leads to misreading or even miscomprehension, the worst being also that one has the feeling that it's impossible to obtain clarification before 6 answers written in a hurry are posted. Thank you. – eyquem Aug 21 '11 at 11:46
  • @Radim _" it's the chunks, i.e. objects returned by A.next(), that are of memory concern here, not A itself"_ Well,.... ?? **A** is the object returned by the ``yield`` instruction that is to say a list of elements from the object **iterable** and a list has no method **next()** by the way. You mean that the weigh of **A** in memory is in fact tthe weigh of its elements, that are the real object since **A** is in fact a list of references to these objects ? You shouldn't use term 'chunks' for these elements, it's confusioning since they're element of what was called **chunk** inside the gen. – eyquem Aug 21 '11 at 11:57
  • No, `A` is a generator -- the yielded objects are only constructed by `A` from `an_iterable` once you call `A.next()`, for example implicitly in a loop: `for chunk in A: print chunk`, or (again implicitly) in `list(A)`. – Radim Aug 21 '11 at 12:06
  • @Radim OK, you're right. I hadn't remarked that I had forgot the ``.next()`` after ``A = grouper(an_iterable,4)`` and ``B = grouper(an_iterable,4)`` By the way, even with ``.next()`` after them it wouldn't have been correct. I've edited my answer. Now it's more coherent with your problem as it is: object **A** == ``gigi.next()``, is still alive after the deleting of its name 'A' because of the persistence of name 'chunk' inside the generator **gigi** – eyquem Aug 21 '11 at 12:33
  • Ok, cool. I added your explanation example (=motivation) to the solution text :-) – Radim Aug 21 '11 at 12:33
  • @Radim OK, cool, now the question is nearly perfect; apart the fact you use the term 'garbage collector' for a thing it doesn't do , in my opinion. But I have the feeling that 99 % of people employ this term like that. I upvoted, however. – eyquem Aug 21 '11 at 12:44
  • @eyquem Sorry I used informal (and probably wrong) terms in my comments. Let's go through it: In quote 1, I meant to say *therefore*, not *because*. If you statically (or dynamically, for that matter) analyze the original function, you see the yielded value of chunk is not accessed anymore. **Therefore**, a clever Python interpreter can just forget about `chunk` until the next iteration. The yielded value of chunk is not *reachable* from within the function. In your "following code", the yielded value *is* reachable from within the function. – phihag Aug 22 '11 at 08:15
  • To 2.: It's all about reachability. `If you cannot reach a value from any context, the garbage collector can free it.`. The equivalent statement is: `If there are no (strong) references to an object, the garbage collector can free it.`. Your FIRST conclusion assumes the false relevance of your proposed code (as explained in my previous comment). Your SECOND conclusion and summary is correct. `any use of chunk after yield chunk is enough to trigger the upholding of a reference to it` is a very good insight into reachability. (continued ..) – phihag Aug 22 '11 at 08:23
  • Note that since Python code can be thought of as immutable (at least during the execution the code in question), a good interpreter can apply precisely this reasoning and there is no use of chunk after `yield chunk` that would trigger upholding of a reference to the yielded value. – phihag Aug 22 '11 at 08:23
0

First, we can have a single expression passed to yield, with an or operator, so that if list is falsy (empty), an alternative expression will be evaluated.

def grouper(iterable, chunksize):
    i = iter(iterable)
    while True:
        yield list(itertools.islice(i, int(chunksize))) or _stop_iteration()

The _stop_iteration() function stops the iteration by not returning, and thus providing no return value. Thus, nothing further will be yielded once the input iterable is exhausted. Recall that a function that doesn't return cannot provide a return value by definition. This is true in almost every programming language that supports functions!

There are two ways a function can provide no value:

  1. By endlessly looping:

    def _stop_iteration_0():
        while True:
            pass
    

    This works, in the sense that only non-empty lists will ever be yielded, but the iteration will then hang. Whereas we want it to Stop. That should ring a bell I hope :)

  2. By raising an exception, and ideally an exception that achieves what we need - namely, stopping the iteration:

    def _stop_iteration():
        raise StopIteration()
    

That's all it takes up to python 3.7.

In Python 3.7, a StopIteration exception leaving the generator function body gets converted to a RuntimeError, and thus we need to catch it ourselves. Thus we arrive at the implementation below, good for python 2.5-3.9:

import itertools

def _stop_iteration():
    raise StopIteration()

def grouper(iterable, chunksize):
    """
    Return elements from the iterable in `chunksize`-ed lists. The last returned
    element may be smaller (if length of collection is not divisible by `chunksize`).

    >>> print(list(grouper(range(10), 3)))
    [[0, 1, 2], [3, 4, 5], [6, 7, 8], [9]]

    If the list is empty, no elements are returned.

    >>> print(list(grouper([], 3)))
    []

    It is an error to invoke the grouper with chunksize less than 1.

    >>> print(list(grouper([1], 0)))
    Traceback (most recent call last):
     ...
    AssertionError
    """

    assert(chunksize > 0)
    i = iter(iterable)
    try:
        while True:
            yield list(itertools.islice(i, int(chunksize))) or _stop_iteration()
    except StopIteration:
        pass

if __name__ == "__main__":
    import doctest
    doctest.testmod()
Kuba hasn't forgotten Monica
  • 95,931
  • 16
  • 151
  • 313
-1

The function grouper as defined has the artifact of creating wasteful duplicates because you have wrapped a function of no effect around itertools.islice. The solution is to delete your redundant code.

I think there are concessions to C derived languages which are non-Pythonic and are causing excess overhead. For example, you have

i = iter(iterable)
itertools.islice(i)

Why does i exist? iter will not cast a non-iterable into an iterable, there aren't such casts. If given a non-iterable, both of those lines would generate an exception; the first does not guard the second.

islice will happliy act as and iterator (although may give economies that a yield statement won't. You've got too much code: grouper probably need not exist.

msw
  • 42,753
  • 9
  • 87
  • 112
  • the `it = iter(iterable)` technique is copy&pasted from python's documentation for the `itertools` module. As for the rest, if you propose an alternative solution for the `grouper` functionality that works, I'll happily upvote it :-) – Radim Aug 20 '11 at 18:50
  • I tried without the `iter` step, and unittests start failing. So it is necessary there. This would actually be an interesting question in itself, don't you want to post it on SO? – Radim Aug 20 '11 at 19:50
  • I found the answer, it's quite simple: if you call `islice` directly over the `iterable`, it will always start iterating from the beginning on each call (creating a new iterator each time). So the `break` never happens and the function just hangs. – Radim Aug 21 '11 at 21:47