0

In todays programming world for multicore, multithreaded CPUs (the one in my notebook has two cores with two threads per core) it makes more and more sense to write code able to utilize provided hardware features. Languages like go(lang) are born in order to make it easier for a programmer to speed up applications by spawning multiple 'independent' processes to synchronize them again later on.

In this context getting in touch with generator functions in Python I have expected that such functions will use the idle time passing between subsequent item requests to prepare the next yield for immediate delivery, but it seems not to be that way - at least so my interpretation of the results I have got from running the below provided code.

What confused me even more is that the caller of the generator function must wait until the function finishes to process all the remaining instructions even if the generator has already delivered all of the items.

Are there any clear reasons I can't currently see, why a generator function doesn't in the idle time between yield requests run the code past the requested yield until it meets next yield instruction and even lets the caller wait in case all the items are already delivered?

Here the code I have used:

import time
startTime = time.time()
time.sleep(1)
def generatorFunctionF():
    print("# here: generatorFunctionF() lineNo #1", time.time()-startTime)
    for i in range(1,4):
        print("# now: time.sleep(1)", time.time()-startTime)
        time.sleep(1)
        print("# before yield", i, time.time()-startTime)
        yield i # yield i
        print("# after  yield", i, time.time()-startTime)
    print("# now: time.sleep(5)", time.time()-startTime)
    time.sleep(5)
    print("# end followed by 'return'", time.time()-startTime)
    return
#:def

def standardFunctionF():
    print("*** before: 'gFF = generatorFunctionF()'", time.time()-startTime) 
    gFF = generatorFunctionF()
    print("*** after:  'gFF = generatorFunctionF()'", time.time()-startTime) 
    print("*** before print(next(gFF)", time.time()-startTime)
    print(next(gFF))
    print("*** after  print(next(gFF)", time.time()-startTime)
    print("*** before time.sleep(3)", time.time()-startTime)
    time.sleep(3)
    print("*** after  time.sleep(3)", time.time()-startTime)
    print("*** before print(next(gFF)", time.time()-startTime)
    print(next(gFF))
    print("*** after  print(next(gFF)", time.time()-startTime)
    print("*** before list(gFF)", time.time()-startTime)
    print("*** list(gFF): ", list(gFF), time.time()-startTime)
    print("*** after:  list(gFF)", time.time()-startTime)
    print("*** before time.sleep(3)", time.time()-startTime)
    time.sleep(3)
    print("*** after  time.sleep(3)", time.time()-startTime)
    return "*** endOf standardFunctionF"

print()
print(standardFunctionF)
print(standardFunctionF())

gives:

>python3.6 -u "aboutIteratorsAndGenerators.py"

<function standardFunctionF at 0x7f97800361e0>
*** before: 'gFF = generatorFunctionF()' 1.001169204711914
*** after:  'gFF = generatorFunctionF()' 1.0011975765228271
*** before print(next(gFF) 1.0012099742889404
# here: generatorFunctionF() lineNo #1 1.0012233257293701
# now: time.sleep(1) 1.0012412071228027
# before yield 1 2.0023491382598877
1
*** after  print(next(gFF) 2.002397298812866
*** before time.sleep(3) 2.0024073123931885
*** after  time.sleep(3) 5.005511283874512
*** before print(next(gFF) 5.005547761917114
# after  yield 1 5.005556106567383
# now: time.sleep(1) 5.005565881729126
# before yield 2 6.006666898727417
2
*** after  print(next(gFF) 6.006711006164551
*** before list(gFF) 6.0067174434661865
# after  yield 2 6.006726026535034
# now: time.sleep(1) 6.006732702255249
# before yield 3 7.0077736377716064
# after  yield 3 7.0078125
# now: time.sleep(5) 7.007838010787964
# end followed by 'return' 12.011908054351807
*** list(gFF):  [3] 12.011950254440308
*** after:  list(gFF) 12.011966466903687
*** before time.sleep(3) 12.011971473693848
*** after  time.sleep(3) 15.015069007873535
*** endOf standardFunctionF
>Exit code: 0
Claudio
  • 7,474
  • 3
  • 18
  • 48
  • Not sure what the second part of your question (about "having to wait until the generator finishes") means. Please clarify what you mean there. – BrenBarn Apr 06 '17 at 04:55
  • Don't forget that the same mechanism for `generators` can be `coroutines`, e.g. `x = yield 10`, this suspends after `yield`ing `10` but the assignment happens at the next `send(5)` or `next(...)`. You may want to look into `asyncio` – AChampion Apr 06 '17 at 05:03
  • This kind of behavior would interfere with timed data, like a daily server query. Fresh data, delivered on demand, are generally more desirable than stale data that were eagerly fetched and then sat on until the next request arrived. – TigerhawkT3 Apr 06 '17 at 05:03
  • @BrenBarn: after the code of a generator function leaves the loop there could be further commands in the generator function. In the example code I have provided the remaining code does not contain any yield keyword, but the caller must wait for the delivery of the items until the code is processed (in the code example time.sleep(5) seconds long) . – Claudio Apr 06 '17 at 05:06
  • @Claudio: I think you are misunderstanding what generators are. They are functions that suspend between yield statements (and return statements). That is what they do. They are not meant to be some kind of optimized way to do processing in the background. If you write `time.sleep(5)`, then it will sleep for 5 seconds when that code is run. There's no fancy lookahead to see what is *going* to happen; it just resumes when you advance the generator. – BrenBarn Apr 06 '17 at 05:13
  • @Claudio if you wish to speculatively do work, simply place the computation for the next item into a thread pool before you yield the previous return value. – donkopotamus Apr 06 '17 at 05:15
  • @Tigerhawk: I understand that if the code between yields accesses any further unknown resources it doesn't make sense to pre-prepare the next yield, BUT using a generator function for such purpose is not what I mean generator functions are for (this is my current understanding). I have seen the use of generator functions most in examples of some mathematical code where all the processing is done within the scope of internal, local variables. – Claudio Apr 06 '17 at 05:15
  • @donkopotamus: would you like to elaborate it a bit more? I just wrote a generator function where there is a time consuming lookup necessary for delivery of the next item ( http://stackoverflow.com/questions/43168829/getting-unique-combinations-from-a-non-unique-list-of-items-faster/43242637#43242637 ). How could I accomplish the task of having the next yield for immediate delivery on request? – Claudio Apr 06 '17 at 05:22
  • Use an infinite loop and call `next` on the generator after using the previous value, and handle the `StopIteration` exception. – TigerhawkT3 Apr 06 '17 at 05:24
  • @BrenBarn In my eyes a generator function's purpose is to deliver items on yield. So there is not a speculative fancy lookahead necessary to expect the caller to request the next item - that is what the generator is for, isn't it? – Claudio Apr 06 '17 at 05:27
  • @Tigerhawk: won't requesting next() block the caller code from further execution until the item is delivered? I think the generator itself is the right place to program the expected behavior into, instead of spawning further processes from within the caller thread. If I had to handle it from the caller thread, I don't need the generator function ... or do I misunderstand what you intended to tell me? – Claudio Apr 06 '17 at 05:33
  • The `next()` pattern will accomplish what you're talking about in this question. Whether your program can complete other tasks while the generator prefetches the next value seems like a separate problem. Note that Python can't execute code in more than one thread at a time due to the GIL; it needs multiple processes to make real use of a multicore system. – TigerhawkT3 Apr 06 '17 at 05:40
  • @TigerhawkT3: it is hard for me to believe that Python can't do what is standard for go(lang) and what even a bash script can do (I am using a single bash script for accessing multiple drives at the same time for md5 calculations, where the I/O is the bottleneck not the CPU ). Probably I misunderstand what you want to tell me, do I? – Claudio Apr 06 '17 at 05:49
  • Look at the newish `asyncio` module. While generator syntax may be convenient for asynchronous and event loop processing, that's not what it was originally created for. The earliest generator tutorials show how they can be used to replace functions that feed lists sequentially to each other. http://masnun.com/2015/11/13/python-generators-coroutines-native-coroutines-and-async-await.html – hpaulj Apr 06 '17 at 05:58
  • @Claudio: Generators are not for asynchronous processing, nor are they for background processing. If you want things to execute in the background, look into something else (like `multiprocessing` or `concurrent.futures`). – BrenBarn Apr 06 '17 at 06:04
  • @BrenBarn: if I understand you right, there are no reasons why generators don't show the behavior I would expect from them, except the fact that they are what they are and they are there in order to handle unnecessary allocations of memory and nothing more. If I want a generator to be able to use multi core / multi threading I have just to write one showing such behavior myself and maybe propose a new category of generators to be provided in next Python versions. Have I got this right? – Claudio Apr 06 '17 at 06:25
  • @Claudio: That is basically right. Generators have nothing to do with multicore/multithreading. However, before you go proposing any new stuff, you should look at the new async features introduced in Python 3.6, as well as the libraries I mentioned above. There are various ways of doing multicore/multithreaded stuff in Python, but generators are orthogonal to that. – BrenBarn Apr 06 '17 at 07:12

3 Answers3

2

Generators were designed as a simpler, shorter, easier-to-understand syntax for writing iterators. That was their use case. People who want to make iterators shorter and easier to understand do not want to introduce the headaches of thread synchronization into every iterator they write. That would be the opposite of the design goal.

As such, generators are based around the concept of coroutines and cooperative multitasking, not threads. The design tradeoffs are different; generators sacrifice parallel execution in exchange for semantics that are much easier to reason about.

Also, using separate threads for every generator would be really inefficient, and figuring out when to parallelize is a hard problem. Most generators aren't actually worth executing in another thread. Heck, they wouldn't be worth executing in another thread even in GIL-less implementations of Python, like Jython or Grumpy.

If you want something that runs in parallel, that's already handled by starting a thread or process and communicating with it through queues.

user2357112
  • 260,549
  • 28
  • 431
  • 505
  • Hmmm ... As stated here [link](https://jeffknupp.com/blog/2013/04/07/improve-your-python-yield-and-generators-explained/index.html) _"Many report having difficulty understanding generators and the yield keyword even after making a concerted effort to teach themselves the topic."_ (that includes me) **I have trouble to accept: "Generators were designed as a simpler, shorter, easier-to-understand syntax for writing iterators.".** – Claudio Apr 07 '17 at 00:28
  • 1
    @Claudio: [Well, they were.](https://www.python.org/dev/peps/pep-0255/) As hard as `yield` may be for newbies to understand, it's a lot easier than writing highly stateful or recursive iterators by hand. Do you think those newbies would have had an *easier* time if this syntax automatically introduced threads as well? – user2357112 Apr 07 '17 at 00:31
  • `GOMAXPROCS` stopped defaulting to `1` in [Go 1.5](https://golang.org/doc/go1.5) (August 2015). – LukeShu Apr 19 '20 at 21:40
  • @LukeShu: Huh, you're right. I must have been going off of outdated docs (or outdated memory) at the time I wrote this answer. – user2357112 Apr 19 '20 at 22:17
1

Because the code between yields may have side effects. You advance the generator not just when you "want the next value" but when you want to advance the generator by continuing to run the code.

BrenBarn
  • 242,874
  • 37
  • 412
  • 384
  • It is clear to me, that some weird code can have side effects. But if I can clearly see myself from the code that there won't be any, it should in my eyes be possible for the interpreter to see it too and provide in such case the feature of immediate delivery. – Claudio Apr 06 '17 at 04:58
  • 1
    @Claudio: Even when you think you can see there are no side effects, you're probably wrong. External code could change the value of global variables like `time` that are used by the generator function. Determining whether the code will have side effects is extremely difficult in a dynamic language like Python where the value of almost anything can change at almost any time. – BrenBarn Apr 06 '17 at 05:11
  • @Claudio can we eagerly evaluate a simple `for x in a: yield x` ? How would you go about proving it? – donkopotamus Apr 06 '17 at 05:13
  • @donkopotamus: I would precede `yield x` with `a = copy.deepcopy(a)` to achieve a clear and predictable behaviour or check if a is a tuple or a string. – Claudio Apr 06 '17 at 05:41
  • @BrenBarn: can't live with `Because the code between yields may have side effects`. Any code and any function/class may have side effects. As there is no space in this comment for more clarifying text I will provide instead of a comment an answer which tries to utilize all the up to now provided suggestions to the extent of my current level of knowledge. – Claudio Apr 06 '17 at 14:39
  • 1
    @Claudio: Right, any code may have side effects, and generator functions may contain any code. That's why no code (in generators or anywhere else) runs in the background unless you explicitly make it do so by launching a new thread or a new process. – BrenBarn Apr 06 '17 at 18:34
  • @BrenBarn: inspired by what you have said above I was able to come down to the core of the issue and had totally rewritten the before given answer. – Claudio Apr 06 '17 at 23:58
-2

The question about the expected feature of a generator functions in Python should be seen from the perspective of a much wider subject of

implicit parallelism

Here an excerpt from Wikipedia "In computer science, implicit parallelism is a characteristic of a programming language that allows a compiler or interpreter to automatically exploit the parallelism inherent to the computations expressed by some of the language's constructs."

The essence of the question Is there any important reason, why a generator function doesn't prefetch next item in the idle time between yields? is actually to ask

"Does Python as programming language support implicit parallelism?"

And in spite of the fact that (citation of the by the author of the question expressed opinion): "There are no any making sense reasons why a generator function shouldn't provide this kind of 'intelligent' behavior.", in the context of Python as programming language the actual correct answer to the question (given already in the comments but not exposing so clearly the core of the issue) is:

The important reason why a Python generator function shouldn't in the background intelligently prefetch the next item for later immediate delivery is the fact that Python as programming language doesn't support implicit parallelism.


This said, it is sure interesting to explore in this context if it is possible to provide in Python the expected feature the explicit way? And yes it is possible. Let's demonstrate in this context a generator function capable of implicit prefetching of next items in the background by explicit programming this feature into such function:

from multiprocessing import Process
import time

def generatorFetchingItemsOnDemand():
    for i in range(1, 4):
        time.sleep(2)
        print("# ...ItemsOnDemand spends 2 seconds for delivery of item")
        yield i

def generatorPrefetchingItemsForImmediateDelivery():
    with open('tmpFile','w') as tmpFile:
        tmpFile.write('')
        tmpFile.flush()

    def itemPrefetcher():
        for i in range(1, 4):
            time.sleep(2)
            print("### itemPrefetcher spends 2 seconds for prefetching an item")
            with open('tmpFile','a') as tmpFile:
                tmpFile.write(str(i)+'\n')
                tmpFile.flush()

    p = Process(target=itemPrefetcher)
    p.start()

    for i in range(1, 4):
        with open('tmpFile','r') as tmpFile:
            lstFileLines = tmpFile.readlines()
        if len(lstFileLines) < i: 
            while len(lstFileLines) < i:
                time.sleep(0.1)
                with open('tmpFile','r') as tmpFile:
                    lstFileLines = tmpFile.readlines()

        yield int(lstFileLines[i-1])
#:def

def workOnAllItems(intValue):
    startTime = time.time()
    time.sleep(2)
    print("workOn(", intValue, "): took", (time.time()-startTime), "seconds")
    return intValue

print("===============================")        
genPrefetch = generatorPrefetchingItemsForImmediateDelivery()
startTime = time.time()
for item in genPrefetch:
    workOnAllItems(item)
print("using genPrefetch workOnAllItems took", (time.time()-startTime), "seconds")
print("-------------------------------")        
print()
print("===============================")        
genOnDemand = generatorFetchingItemsOnDemand()
startTime = time.time()
for item in genOnDemand:
    workOnAllItems(item)
print("using genOnDemand workOnAllItems took", (time.time()-startTime), "seconds")
print("-------------------------------")        

The provided code uses the file system for interprocess communication, so feel free if you want re-use this concept in your own programming to replace it by existing other, faster interprocess communication mechanisms. Implementing the generator function the way demonstrated here, does what the author of the question expected a generator function should do and helps to speed up the application (here from 12 down to 8 seconds):

>python3.6 -u "generatorPrefetchingItemsForImmediateDelivery.py"
===============================
### itemPrefetcher spends 2 seconds for prefetching an item
### itemPrefetcher spends 2 seconds for prefetching an item
workOn( 1 ): took 2.0009119510650635 seconds
### itemPrefetcher spends 2 seconds for prefetching an item
workOn( 2 ): took 2.0010197162628174 seconds
workOn( 3 ): took 2.00161075592041 seconds
using genPrefetch workOnAllItems took 8.013896942138672 seconds
-------------------------------

===============================
# ...ItemsOnDemand spends 2 seconds for delivery of item
workOn( 1 ): took 2.0011563301086426 seconds
# ...ItemsOnDemand spends 2 seconds for delivery of item
workOn( 2 ): took 2.001920461654663 seconds
# ...ItemsOnDemand spends 2 seconds for delivery of item
workOn( 3 ): took 2.0002224445343018 seconds
using genOnDemand workOnAllItems took 12.007976293563843 seconds
-------------------------------
>Exit code: 0
Claudio
  • 7,474
  • 3
  • 18
  • 48