5

I'm having trouble understanding the yield keyword.
I understand the effects in terms of what happens when the program gets executed, but I don't really understand how much memory it uses.

I'll try to explain my doubts using examples.
Let's say we have three functions:

HUGE_NUMBER = 9223372036854775807

def function1():
    for i in range(0, HUGE_NUMBER):
        yield i

def function2():
    x = range(0, HUGE_NUMBER)
    for i in x:
        yield i

def function3(file):
    with open(file, 'r') as f:
        dictionary = dict(csv.reader(f, delimiter = ' '))
    for k,v in dictionary.iteritems():
        yield k,v

Does the huge range actually get stored in memory if I iterate over the generator returned by the first function?

What about the second function?

Would my program use less memory if I iterated over the generator returned by the third function (as opposed to just making that dictionary and iterating directly over it)?

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
iCanLearn
  • 336
  • 1
  • 4
  • 17
  • I feel like my title is not very good. If someone has an idea on how to improve it, feel free to do so. – iCanLearn Sep 07 '15 at 20:51
  • Yes, it is stored in memory as soon as you call `next()` on it for the first time. Use `xrange()` if you're on Python 2, otherwise it is fine in Python 3. Related: http://stackoverflow.com/a/25457580/846892 – Ashwini Chaudhary Sep 07 '15 at 20:51
  • 1
    Which Python version? – user2864740 Sep 07 '15 at 20:53
  • 2
    Think of `yield` as freezing the execution of the function just after it yields, until the next value is pulled from the generator (by calling `next(gen)`. Since at that point you have already build the huge range and bound it to `x` in the scope of that function, your memory is already gone, `yield` doesn't change that fact. Using a lazy iterator like `xrange()` (or `imap()`, `izip()`, etc.) in combination with it however would. – Lukas Graf Sep 07 '15 at 20:55
  • 1
    @user2864740: they use `dict.iteritems()`, so Python 2.x. – Martijn Pieters Sep 07 '15 at 21:00

3 Answers3

6

The huge list produced by the Python 2 range() function will need to be stored, yes, and will take up memory, for the full lifetime of the generator function.

A generator function can be memory efficient provided the results it produces are calculated as needed, but the range() function produces all your results up front.

You could just calculate the next number:

def function1():
    i = 0
    while i < HUGE_NUMBER:
        yield i
        i += 1

and you'd get the same result, but you wouldn't be storing all numbers for the whole range in one go. This is essentially what looping over the xrange() object does; it calculates numbers as requested. (In Python 3 xrange() replaced range()).

The same applies for your function3; you read the whole file into a dictionary first, so that is still stored in memory for you as you iterate. There is no need to read the whole file into memory just to yield each element afterwards. You could just loop over the file and yield lines:

def function3(file):
    seen = set()
    with open(file, 'r') as f:
        reader = csv.reader(f, delimiter = ' ')
        for k, v in reader:
            if k in seen:
                # already seen
                continue
            seen.add(k)
            yield k, v

This only stores keys seen to avoid duplicates (like the dictionary would) but the values are not stored. Memory increases as you iterate over the generator. If duplicates are not an issue, you could omit tracking seen keys altogether:

def function3(file):
    with open(file, 'r') as f:
        reader = csv.reader(f, delimiter = ' ')
        for k, v in reader:
            yield k, v

or even

def function3(file):
    with open(file, 'r') as f:
        reader = csv.reader(f, delimiter = ' ')
        return reader

as the reader is iterable, after all.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • 1
    @zch: sure, but this isn't Python 3. The `range()` object is the same thing as `xrange()` in Python 2, but with support for large integers (greater than `sys.maxint`) and a few other niceties added. It basically calculates the numbers like the generator does. – Martijn Pieters Sep 07 '15 at 21:01
  • Oh, so, do I understand correctly that the whole point of yield is to make code prettier because it's a level of abstraction over how you generate the next item (which you could do without calling the function with yield, but would possibly make your code less readable)? – iCanLearn Sep 07 '15 at 21:10
  • @iCanLearn: `yield` simply lets you build a generator function. How that function yields values and what memory the function uses for that is still up to the programmer. You can build more memory efficient programs with them, but it is not a given, just like giving someone professional builders tools doesn't mean they can build a sturdy and practical house. :-) – Martijn Pieters Sep 07 '15 at 21:20
  • 1
    @iCanLearn: the code doesn't have to be *pretty* to be memory efficient here, either. – Martijn Pieters Sep 07 '15 at 21:21
  • @iCanLearn: perhaps the [top answer to this question](http://stackoverflow.com/questions/231767/what-does-the-yield-keyword-do-in-python) can help here? – Martijn Pieters Sep 07 '15 at 21:23
  • What I mean to say is that using a function that yields is basically like writing the body of that function everywhere in the code where the function is called, just replacing the yield line with whatever you want to do with the values in the loop. In other words, there's nothing automagical about yield. Am I right? – iCanLearn Sep 07 '15 at 21:29
  • @iCanLearn: No, that is not right. The code is not inlined there. The generator function is *paused*, until you ask for the next value, at which point the code is run until the next `yield` produces that value. Again, read that answer. – Martijn Pieters Sep 07 '15 at 21:30
  • Thanks, I get it now. – iCanLearn Sep 07 '15 at 21:51
0

The generator object contains a reference to the function's scope and by extension all local objects within it. The way to reduce memory usage is to use iterators at every level possible, not just at the top level.

Ignacio Vazquez-Abrams
  • 776,304
  • 153
  • 1,341
  • 1,358
-1

If you want to check how much memory an object uses, you can follow this post as a proxy. I found it helpful.

"Try this:

sys.getsizeof(object)

getsizeof() calls the object’s sizeof method and adds an additional garbage collector overhead if the object is managed by the garbage collector."

A recursive recipe

shaimar
  • 128
  • 1
  • 1
  • 10
  • Do not post answers that are simply links to other pages. Also, this question is 3 years old and it has already been answered. – Havenard Oct 28 '18 at 20:55
  • 1
    There's no problem in answering old questions, on the contraire. The issue is not posting here what the solution is about, links can die easily and your answer will become void. – brasofilo Oct 28 '18 at 20:58