Questions about Python 'yield' keyword that I have not found answers elsewhere, and its specific use in a code I am working on

Question

I am facing a python script that was handed over to me that works. I understand the purpose of that script and its role in the big picture of how it interacts with other modules, as well as its internal architecture pretty well in most places. However, I have to do a major overhaul of that script, essentially to remove some old classes and add plenty of new subclasses so that it provides more functionality that we need. My question comes largely from what I have seen to be some unexplained discrepancy between some functions returning a list with that object in it vs yielding that object back to itself.

# These functions are methods that belong to a class. 
# There is a top level script that instantiates that class and calls
# these methods on that class, and depending on the `self.mode` variable located in the instance namespace, it invokes the different subsequent methods, 
# which are either generateHybridSim() or generatePureHwSim()
# It is worth pointing out here that HybridSimStep and ShangHWSimStep
# are both classes themselves and that they will be instantiated later on as
# I will describe after this chunk of code

def generateSubTests(self) :
  if self.mode == TestStep.HybridSim :
    return self.generateHybridSim()
  elif self.mode == TestStep.PureHWSim or self.mode == TestStep.AlteraSyn \
     or self.mode == TestStep.AlteraNls or self.mode == TestStep.XilinxSyn  :
  return self.generatePureHWSim()

return []

def generateHybridSim(self) :
  return [ HybridSimStep(self) ]

def generatePureHWSim(self) :
  yield ShangHWSimStep(self)
  num_iter = self.max_scheduling_iterations
  if num_iter > 1 :
    for i in range(1) :
      sim_step = ShangHWSimStep(self)
      sim_step.option = self.option.copy()
      sim_step.hls_base_dir = os.path.join(sim_step.hls_base_dir, str(i))
      sim_step.rtl_output = os.path.join(sim_step.hls_base_dir, sim_step.test_name + ".sv")
      sim_step.option['max_scheduling_iterations'] = i + 1
      yield sim_step

Ultimately, regardless of whether the generateHybridSim() or generatePureHwSim() methods are invoked, they all get called in another module in the exact same way:

# The 'job' that is in front of generateSubTests() is the instance's
# variable name, and you can see that prepareTest() and runTest()
# methods get called on the subtest iterable object, which so happens
# to be class's instance.
# So in short, these two methods are not defined within generateSubTests() method, but
# rather under the classes that the generateHybridSim() and 
# generatePureHWSim() methods had returned or yielded respectively.

for subtest in job.generateSubTests() :
      subtest.prepareTest()
      subtest.runTest()
      time.sleep(1)
      next_active_jobs.append(subtest)

I'm really confused here now, and I don't know what's the significance of the use of yield here vs return, and I need to figure out why the previous programmer who wrote this script did that. This is because I'll be implementing new subclasses that must themselves contain their own generateSubTests() methods and must adhere to the same function call. The fact that he did for subtest in job.generateSubTests means that I am restricted to only returning a list with the class in it, or yielding the class itself, otherwise it wouldn't fit the python for loop iteration protocol. I have tried testing the code by modifying the yield statements in generatePureHWSim() to return ones like in generateHybridSim() and it seems to run fine, although I can't be sure if there's any subtle bugs that has introduced. However, I don't know if I'm missing out something here. Did the previous programmer wanted to facilitate concurrency http://www.dabeaz.com/coroutines/index.html by turning the function into a generator using yield?

He has since left our lab entirely and so I'm not able to consult him for his help.

Also, I've read up on yield from various sources including the post: What does the "yield" keyword do in Python? ; although they have helped me understand what yield does, I still don't understand how using it here helps us in our context. In fact, I don't even understand why the previous programmer wanted to implement a loop with for subtest in job.generateSubTests() : and force the generatePureHWSim() and generateHybridSim() methods to have to be generators themselves, so that we can have a loop just to call the other methods of prepareTest() and runTest() on the instance. Why couldn't he have just returned the class directly and called those methods???

This is really tripping me up. I would greatly greatly appreciate any help here!!! Thank you.

PS: one more question - I noticed that in general, if you have a function that you defined as:

def a():
  return b
  print "Now we return c"
  return c

It seems like whenever the first statement within is executed, and b is returned, then the function completes execution and c is never returned because that statment that comes after return b will never be touched. Try adding the print statment, and you'll see that it will never be printed.

However, when it comes to yield:

def x():
  yield y
  print "Now we yield z"
  yield z

I noticed that even after the first yield y statement has been executed, the subsequent yield z will get executed. Try adding the print statement, and you'll see that it gets printed out. This is something I observed as I was debugging the above code, and I don't understand this difference in behavior between yield and return. Can someone please enlighten me on it?

Thank you.

score 3 · Accepted Answer · answered Mar 04 '15 at 04:30

3

I'm glad to tell you there's no concurrency involved.

The previous programmer wanted to have generateSubTests return a collection of subtests (maybe 0, 1 or more subtests). Each of those subtests will then be processed accordingly in the for subtest in job.generateSubTests(): loop.

Actually, if you look closely, generateHybridSim returns a normal Python list containing one subtest, not a generator object. But lists and generator objects are actually very similar things in this context - a sequence of subtests.

You have to realize that generatePureHWSim(self) is almost equivalent to the following code:

def generatePureHWSim(self) :
  output_list = []
  output_list.append(ShangHWSimStep(self))
  num_iter = self.max_scheduling_iterations
  if num_iter > 1 :
    for i in range(1) :
      sim_step = ShangHWSimStep(self)
      sim_step.option = self.option.copy()
      sim_step.hls_base_dir = os.path.join(sim_step.hls_base_dir, str(i))
      sim_step.rtl_output = os.path.join(sim_step.hls_base_dir, sim_step.test_name + ".sv")
      sim_step.option['max_scheduling_iterations'] = i + 1
      output_list.append(sim_step)
  return output_list

but with one exception. While the code above does all the calculation upfront and put all the results into a list in memory, your version with yield will immediately yield a single subtest, and only do the following calculations when asked for the next result.

There are multiple potential benefits to this, including:

Saving on memory (data is loaded only one-at-a-time rather than being loaded into a list all at once)
Saving on calcuation (if you might break out of the loop early based on what gets returned)
Sequencing side-effects in a different order (personally not recommended, makes reasoning about code pretty hard).

Regarding your second question, as you observed, execution in a Python function ends when you hit the return statement. Putting more code after the return statement in the same code-block is pointless.

yield does something slightly more complex, in that it returns a generator object which is closer to a list.

The code below:

def generator_example():
    yield 1
    print "x"
    yield 2

can't really be compared with:

def return_example():
    return 1
    print "x"
    return 2

but is much closer to:

def list_example():
    output_list = []
    output_list.append(1)
    print "x"
    output_list.append(2)
    return output_list

generator_example and list_example both return a sequence that can be iterated over using for-loops.

Unrelated comment on the code

The bit below is pretty weird though.

  if num_iter > 1 :
    for i in range(1) :
      sim_step = ShangHWSimStep(self)

No reason to use for i in range(1), that just loops once, with i set to 0. I'd strip the for i in range(1) bit out, dedent the code and either replace all occurences of i with 0, or better, rename i to be more informative and set it explicitly to 0.

answered Mar 04 '15 at 04:30

zehnpaard

6,003
2
25
40

Thank you so much for your great explanation here! It has really clarified a lot! But with all that you've discussed here about `yield` vs `return`, how do you think I should go about deciding on which to use for new classes that I have to code that need to have their own `generateSubTests()` methods? Is it that only for those `generate()` methods which I may want to ultimately have multiple calls to the function be done through having multiple subtests that I use `yield`? Or is it better to for all cases, stick to `yield`, since it has the advantages that you mentioned? – AKKO Mar 04 '15 at 06:47
1

While personally I would go for `yield` because I'm very comfortable with it, it's probably best to use language features that you're familiar with when you are making modifications to production code. In most places, you won't encounter any problems by having a function return a list, instead of using yield. Problems occur if 1) objects like `ShangHWSimStep` are huge, 2) you are making lots of them in the same `generate` call, or 3) the calculations involved in creating them is costly. Then you should bite the bullet and use `yield`. – zehnpaard Mar 04 '15 at 06:55
1

On the other hand, I do truly recommend getting used to `yield` in an environment that's not modifying production code. It's one of those features of Python that is amazingly useful in all sorts of unexpected places, once you get the hang of it. – zehnpaard Mar 04 '15 at 06:57
Ok cool! I shall get started being familiar with `yield` then and use it here. Separately, I want to ask a question about the `return` and `yield` of other functions within a function, something which as you know has been used here throughout the code samples I have provided. If I understand correctly, there is the `self` keyword inside each of these functions that we `yield` or `return` because these are actually class methods, is that right? If we were not dealing with class methods, just pure functions, then we wouldn't have had to specify `self`, but just the other keyword args right? – AKKO Mar 04 '15 at 07:09
And following on that, I then have another question, why is it that in this case, the code-block could work even though the method `generateSubTests()` in this case was able to successfully `return` or `yield` those other methods (which are `generateHybridSim()` and `generatePureHWSim()` ), even though these other methods were defined only after `generateSubTests()`. – AKKO Mar 04 '15 at 07:15
Because from what I know about functions, I'm pretty sure with this order that it wouldn't work because the interpreter wouldn't understand why these functions in the `return` or `yield` were there since they are only defined after the function that `return` or `yield` them. Has this got to do with the fact that over here, we are dealing with the methods of classes, and whenever a class has been instantiated, all its methods have also been compiled and we can call the methods and the methods can call themselves in any order that need not follow the ordering of their definitions? – AKKO Mar 04 '15 at 07:16
I have a feeling that I may have some misconceptions here, so please do correct me if there are. Thank you very much! – AKKO Mar 04 '15 at 07:17
Do you have functions within a ***function***? Or do you mean a function within a ***class***? In the examples you provide, I don't see any functions being declared within another function, but as you mention, all of the function definitions appear to be methods of objects. – zehnpaard Mar 04 '15 at 09:42
In terms of order of function definition, that's a pretty complicated question in its own right. You may want to check [this stackoverflow answer](http://stackoverflow.com/a/4937565/3155195), look around for others and if none of them explain your exact issue, perhaps it's best to post a separate question. If you put the link here I'll look at it when I have time as well. – zehnpaard Mar 04 '15 at 09:46
I'm actually asking about both. Firstly, I wanted to clarify that the way we did things here that we had to specify the `self` keyword as argument in each method that got `yield` or `return` by `generateSubTests()` (where everything is nested in a class) was because these are functions within the class. – AKKO Mar 04 '15 at 09:51
I wasn't really referring to cases where functions get declared within another function, but rather why is it that we could have `generateHybridSim()` and `generatePureHWSim()` be defined after the `generateSubTests()` method, yet it works. I would expect that we will need to sequentially defined the other two methods first before the definition of `generateSubTests()` in order for `generateSubTests()` to be able to work the way it does, to `return` or `yield` the other methods. – AKKO Mar 04 '15 at 09:52
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/72213/discussion-between-zehnpaard-and-akko). – zehnpaard Mar 04 '15 at 09:53
thank you for sharing the links. I don't know if what I've explained here is clear enough for you to be able to answer, but otherwise I can post a separate question. – AKKO Mar 04 '15 at 09:53

Questions about Python 'yield' keyword that I have not found answers elsewhere, and its specific use in a code I am working on

1 Answers1