10

The below code summarises all the numbers in the list held in all_numbers. This makes sense as all the numbers to be summarised are held in the list.

def firstn(n):
    '''Returns list number range from 0 to n '''
    num, nums = 0, []
    while num < n:
        nums.append(num)
        num += 1
    return nums

# all numbers are held in a list which is memory intensive
all_numbers = firstn(100000000)
sum_of_first_n = sum(all_numbers)

# Uses 3.8Gb during processing and 1.9Gb to store variables
# 13.9 seconds to process
sum_of_first_n 

When converting the above function to a generator function, I find I get the same result with less memory used (below code). What I don't understand is how can all_numbers be summarised if it doesn't contain all the numbers in a list like above?

If the numbers are being generated on demand then one would have generate all numbers to summarise them all together, so where are these numbers being stored and how does this translate to reduced memory usage?

def firstn(n):
    num = 0
    while num < n:
        yield num
        num += 1

# all numbers are held in a generator
all_numbers = firstn(100000000)
sum_of_first_n = sum(all_numbers)

# Uses < 100Mb during processing and to store variables
# 9.4 seconds to process
sum_of_first_n

I understand how to create a generator function and why you would want to use them but I don't understand how they work.

David Ross
  • 1,087
  • 1
  • 14
  • 21
  • You might have to generate all the numbers ... but you **don't have to store them all at once**, like your first approach – donkopotamus Jun 18 '17 at 10:18
  • 6
    They're not stored anywhere. See [David Beazley's presentation on generators](http://www.dabeaz.com/generators/). The [pdf](http://www.dabeaz.com/generators/Generators.pdf) – Peter Wood Jun 18 '17 at 10:22
  • As others have said, they are not stored anywhere. That's the whole point of generators. Often you need to look at a lot of values in sequence, but at no point you need all of those values in memory. – timgeb Jun 18 '17 at 10:30
  • Note that your generator version of `firstn` is already provided as `range` (`xrange` in Python 2.x, where `range` would be the list version). – jonrsharpe Jun 18 '17 at 10:38
  • 1
    @PeterWood Thanks for the PDF. It's really useful. – Chiheb Nexus Jun 18 '17 at 11:21
  • 1
    @PeterWood Agreed, this PDF tutorial is gold. Thank you. – David Ross Jun 18 '17 at 11:46

4 Answers4

9

A generator do not store the values, you need to think of a generator as a function with context, it will save it state and GENERATE the values each time it is asked to do so, so, it gives you a value, then "discard" it, hold the context of the computation and wait till you ask for more; and will do so until the generator context is exhausted.

def firstn(n):
    num = 0
    while num < n:
        yield num
        num += 1

In this example you provide, the "only" memory used is num, is where the computation is stored, the firstn generator holds the num in its context till the while loop is finised.

Netwave
  • 40,134
  • 6
  • 50
  • 93
  • So creating a generator doesn't even take up disk? Say I am reading a looping over multiple files and for each loop I download file, yield data then do it for next file. In such case I wouldn't use disk? – haneulkim Aug 08 '23 at 01:10
  • You would just use the size of the data you are yielding each time, if I understood your example correctly – Netwave Aug 12 '23 at 16:02
2

I think a real example of what your first and second functions/methods are doing under the hood will be helpful and you'll understand better what's going.

Let's print what Python is hidden while processing each function/method using locals() :

locals(): Update and return a dictionary representing the current local symbol table. Free variables are returned by locals() when it is called in function blocks, but not in class blocks.

>>> def firstn(n):
     '''Returns list number range from 0 to n '''
     num, nums = 0, []
     while num < n:
         nums.append(num)
         num += 1
         print(locals())
     return nums
>>> firstn(10)

Will print:

{'nums': [0], 'n': 10, 'num': 1}
{'nums': [0, 1], 'n': 10, 'num': 2}
{'nums': [0, 1, 2], 'n': 10, 'num': 3}
{'nums': [0, 1, 2, 3], 'n': 10, 'num': 4}
{'nums': [0, 1, 2, 3, 4], 'n': 10, 'num': 5}
{'nums': [0, 1, 2, 3, 4, 5], 'n': 10, 'num': 6}
{'nums': [0, 1, 2, 3, 4, 5, 6], 'n': 10, 'num': 7}
{'nums': [0, 1, 2, 3, 4, 5, 6, 7], 'n': 10, 'num': 8}
{'nums': [0, 1, 2, 3, 4, 5, 6, 7, 8], 'n': 10, 'num': 9}
{'nums': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], 'n': 10, 'num': 10}
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

But:

>>> def firstn(n):
     num = 0
     while num < n:
         yield num
         num += 1
         print(locals())

>>> list(firstn(10))

will print:

{'n': 10, 'num': 1}
{'n': 10, 'num': 2}
{'n': 10, 'num': 3}
{'n': 10, 'num': 4}
{'n': 10, 'num': 5}
{'n': 10, 'num': 6}
{'n': 10, 'num': 7}
{'n': 10, 'num': 8}
{'n': 10, 'num': 9}
{'n': 10, 'num': 10}
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]

So, as you can see the second function/method (your generator) don't care about the past or the next process's results. This function remembers only the last value (the condition to break the while loop) and generate the results in demand.

However, in your first example, your function/method need to store and remember every step along with the value used to stop the while loop then returning the final result... Which makes the process very long compared to your generator.

Chiheb Nexus
  • 9,104
  • 4
  • 30
  • 43
1

this example may help you understand how and when the items are calculated:

def firstn(n):
    num = 0
    while num < n:
        yield num
        print('incrementing num')
        num += 1

gen = firstn(n=10)

a0 = next(gen)
print(a0)      # 0
a1 = next(gen) # incrementing num
print(a1)      # 1
a2 = next(gen) # incrementing num
print(a2)      # 2

the function does not return, but it keeps its internal state (stack frame) and continues from the point it yielded last time.

the for loop just calls next repeatedly.

your next value is calculated on-demand; not all of the possible values need to be in-memory at the time.

hiro protagonist
  • 44,693
  • 14
  • 86
  • 111
  • But, `range` isn't a generator. – Chiheb Nexus Jun 18 '17 at 10:47
  • See this [optimizing [x]range](http://article.gmane.org/gmane.comp.python.python-3000.devel/8732). `range()` is basicaly built in the same way. – Chiheb Nexus Jun 18 '17 at 10:52
  • 1
    i'm sorry, i don't understand. what are you argumenting for/against? – hiro protagonist Jun 18 '17 at 10:57
  • I don't think it's a generator like the 'usual' generators ... it returns `range` as type not the usual generator type. – Chiheb Nexus Jun 18 '17 at 11:07
  • ah, you mean it is not of the type `` but ``? i agree on that. but as it behaves like an iterator you can call it a generator (according to the definition [here](https://wiki.python.org/moin/Generators)). – hiro protagonist Jun 18 '17 at 11:12
  • Yes. But can you call `next(range(10))` ? No. You'll have a TypeError exception. Otherwise, this is not the topic of this question :-) – Chiheb Nexus Jun 18 '17 at 11:15
  • 1
    oh, i see what you mean. and in this form i agree. – hiro protagonist Jun 18 '17 at 11:19
  • `range` objects have other functions that generators don't. For example you can query whether a number is in a range: `5 in range(0, 5)` and it will perform an efficient comparison rather than compare every value in the range. – Peter Wood Jun 18 '17 at 14:28
0

If the sum-function was written in Python, it would probably be similar to this:

def sum(iterable, start=0):
    part_sum = start
    for element in iterable:
        part_sum += element
    return part_sum

(Of course there are many differences between this function and the real sum, but the way it works for your example is very similar.)

If you call sum(all_numbers) with a generator, the variable element only stores the current element and the variable part_sum only stores the sum of all numbers that came before the current element. This way the whole sum can be calculated using only two variables, which obviously need significantly less space than an array that stores all of the 100000000 numbers. The generator itself, as others have pointed out, just stores it current state and continues its calculations from there when called with next and therefore only needs to store n and num in your example.

BurningKarl
  • 1,176
  • 9
  • 12