0

I want to know the number of items that a generator has generated.

I'm trying to do this by using the output of enumerate to set a global variable. It works on simple tests but goes wrong once I try to adapt the technique to my real application case.

The following script tests first a generator based on an iteration over the lines of a file, then a generator based on the parsing of a file using a bioinformatics library I want to use:

#!/usr/bin/env python3


def test1(delete=False):
    # I have to comment the following otherwise I get:
    # $ ./test.py
    # Traceback (most recent call last):
    #   File "./test.py", line 60, in <module>
    #     test1()
    #   File "./test.py", line 31, in test1
    #     print(nb_things)
    # UnboundLocalError: local variable 'nb_things' referenced before assignment
    # if delete:
    #     try:
    #         del nb_things
    #         print("deleted nb_things")
    #     except NameError:
    #         pass
    with open("test.py") as this_file:
        def my_gen():
            for i, thing in enumerate(this_file, start=1):
                yield "just_to_test"
            global nb_things
            nb_things = i
            return

        g = my_gen()
        for _ in g:
            pass

        print(nb_things)
    return 0


import pysam

def test2(delete=False):
    if delete:
        try:
            del nb_things
            print("deleted nb_things")
        except NameError:
            pass
    with pysam.AlignmentFile("/path/to/a/bam/file", "rb") as bamfile:
        def my_gen():
            for i, thing in enumerate(bamfile.fetch(), start=1):
                yield "just_to_test"
            global nb_things
            nb_things = i
            return

        g = my_gen()
        for _ in g:
            pass

        print(nb_things)
    return 0

if __name__ == "__main__":
    test1()
    print("end of test 1")
    test2()
    print("end of test 2")

(As you can see in the comment in the above script, very strange things happen if I include code that mention my global variable without even being executed.)

When I execute the above code, the first test succeeds, but not the second, despite a very similar code structure:

$ ./test.py
63
end of test 1
Traceback (most recent call last):
  File "./test.py", line 62, in <module>
    test2()
  File "./test.py", line 53, in test2
    for _ in g:
  File "./test.py", line 49, in my_gen
    nb_things = i
UnboundLocalError: local variable 'i' referenced before assignment

My main question is:

Why does the enumeration counter still exist after the end of the for loop in the first case and not in the second?

I suspect that this has to do with the way the iteration is stopped. In the second case the generator somehow causes the enumerate result to cease to exist after the internal iterator gets stops.

What could cause such a difference?

A second question that occurred to me while designing the above test script is the following:

Why is the global variable nb_things considered local if I put code referencing it but not even executed? (note the delete=False, and the absence of a message mentioning the deletion)

I'm using python 3.6 and pysam version 0.10.0.


For an earlier version of the real code (but the essential approach is there), and clues regarding why I ended up defining my generator in the main function, see this question. (Essentially, the reason is that the generator actually uses a function that is defined depending on command-line options.)

Community
  • 1
  • 1
bli
  • 7,549
  • 7
  • 48
  • 94
  • I actually just found a potential answer for my main question: the bam file is empty, which would explain the absence of the enumeration counter. – bli Feb 03 '17 at 18:17
  • 2
    While this is interesting from an academic sense, I think it is terrible practice to attempt to do this. Just yield a `(i, value)` pair and deal with `i` outside your generator! Globals in a generator make me shudder. – juanpa.arrivillaga Feb 03 '17 at 18:20
  • Also, please clean up your code (define your functions outside your logic, not right before you use them once!) and try to produce a minimal, complete and verifiable example. People shouldn't have to sift through the irrelevant parts of your code to answer this question. If you have a huge block commented out, *maybe that's a good indication you should not include it on in a StackOverflow question*. – juanpa.arrivillaga Feb 03 '17 at 18:22
  • @juanpa.arrivillaga The commented block is essential to the understanding of the second question. What's wrong with defining some functions using variables from the outer scope instead of passing them as arguments ? – bli Feb 03 '17 at 18:31
  • Global variables lead to hard to reason-about code, and can be buggy as hell as you are discovering. Furthermore, it is good practice to maintain side-effects to a minimum, especially where they would not be expected. No Python programmer would expect a side-effect in a generator, aside form maybe a debugging print, but least of all, a side-effect that leads to global state change! You're just asking to be bit in the ass. – juanpa.arrivillaga Feb 03 '17 at 18:34
  • Concerning passing the `i`, my actual code includes several levels of wrapping the generator, and these wrappers don't do anything with the `i`. I somewhat feel that it would be cleaner not to have to pass this variable along. The `i` is just used to output some information while the generator is consumed, and I was hoping to be able to use it to also get the final number of generated items. – bli Feb 03 '17 at 18:36
  • I mean, do you even realize you've defined a closure where one of the free-variables is a file-pointer inside of a `with` block? That's just weird. When people give advice like "use global variables sparingly, if at all" or "avoid side effects in your functions unless necessary", this is based on the experience of software engineers everywhere and the best practices they have uncovered to write maintainable, readable, and easy-to-reason-about code. Anyway, the error seems self-explanatory, you are referencing a local variable before assignment! – juanpa.arrivillaga Feb 03 '17 at 18:37
  • @juanpa.arrivillaga I understand that `global` is generally not advised. This is actually the first time in more than 10 years as a python user that I encounter a case where I feel like trying it (there must be cases where `global` is a relevant solution, I suppose?). Also, I added information regarding why I define the generator function late in the code in my real application case. I don't know if my choice was the best one, but it came naturally given the issues I faced. – bli Feb 03 '17 at 20:40

0 Answers0