global vs. local namespace performance difference

Question

Why is it that executing a set of commands in a function:

def main():
    [do stuff]
    return something
print(main())

will tend to run 1.5x to 3x times faster in python than executing commands in the top level:

[do stuff]
print(something)

Can you provide an [MCVE](http://stackoverflow.com/help/mcve) so that this claim can be tested and investigated? — TigerhawkT3, Jan 11 '16 at 03:40
Other than some code where the claimed speeds occur, how do you actually time it? — Reti43, Jan 11 '16 at 04:16
@Reti43 from time import time ... start = time() ..... finish = time().. elapsed = start-finish — user119264, Jan 11 '16 at 04:31
@user119264 Thats really poor methodology. Because of the short timeframes, to test code speeds you need to use 1000s of tests using a library like `timeit` to get any meaningful results. A single result code be off because another resource hungry operation in another executable was running at the same time. — , Jan 11 '16 at 05:13
Related: http://stackoverflow.com/questions/11241523/why-does-python-code-run-faster-in-a-function — Alex Riley, Jan 11 '16 at 17:35

Dimitris Fasarakis Hilliard · Accepted Answer · 2016-06-09T19:56:13.320

The difference does indeed greatly depend on what "do stuff" actually does and mainly on how many times it accesses names that are defined/used. Granted that the code is similar, there is a fundamental difference between these two cases:

In functions, the byte code for loading/storing names is done with LOAD_FAST/STORE_FAST.
In the top level scope (i.e module), the same commands are performed with LOAD_NAME/STORE_NAME which are more sluggish.

This can be viewed in the following cases, I'll be using a for loop to make sure that lookups for variables defined is performed multiple times.

Function and LOAD_FAST/STORE_FAST:

We define a simple function that does some really silly things:

def main():
    b = 20
    for i in range(1000000): z = 10 * b 
    return z

Output generated by dis.dis:

dis.dis(main)
# [/snipped output/]

             18 GET_ITER
        >>   19 FOR_ITER                16 (to 38)
             22 STORE_FAST               1 (i)
             25 LOAD_CONST               3 (10)
             28 LOAD_FAST                0 (b)
             31 BINARY_MULTIPLY
             32 STORE_FAST               2 (z)
             35 JUMP_ABSOLUTE           19
        >>   38 POP_BLOCK

# [/snipped output/]

The thing to note here is the LOAD_FAST/STORE_FAST commands at the offsets 28 and 32, these are used to access the b name used in the BINARY_MULTIPLY operation and store the z name, respectively. As their byte code name implies, they are the fast version of the LOAD_*/STORE_* family.

Modules and LOAD_NAME/STORE_NAME:

Now, let's look at the output of dis for our module version of the previous function:

# compile the module
m = compile(open('main.py', 'r').read(), "main", "exec")

dis.dis(m)
# [/snipped output/]

             18 GET_ITER
        >>   19 FOR_ITER                16 (to 38)
             22 STORE_NAME               2 (i)
             25 LOAD_NAME                3 (z)
             28 LOAD_NAME                0 (b)
             31 BINARY_MULTIPLY
             32 STORE_NAME               3 (z)
             35 JUMP_ABSOLUTE           19
        >>   38 POP_BLOCK

# [/snipped output/]

Over here we have multiple calls to LOAD_NAME/STORE_NAME, which, as mentioned previously, are more sluggish commands to execute.

In this case, there is going to be a clear difference in execution time, mainly because Python must evaluate LOAD_NAME/STORE_NAME and LOAD_FAST/STORE_FAST multiple times (due to the for loop I added) and, as a result, the overhead introduced each time the code for each byte code is executed will accumulate.

Timing the execution 'as a module':

start_time = time.time()
b = 20 
for i in range(1000000): z = 10 *b
print(z)
print("Time: ", time.time() - start_time)
200
Time:  0.15162253379821777

Timing the execution as a function:

start_time = time.time()
print(main())
print("Time: ", time.time() - start_time)
200
Time:  0.08665871620178223

If you time loops in a smaller range (for example for i in range(1000)) you'll notice that the 'module' version is faster. This happens because the overhead introduced by needing to call function main() is larger than that introduced by *_FAST vs *_NAME differences. So it's largely relative to the amount of work that is done.

So, the real culprit here, and the reason why this difference is evident, is the for loop used. You generally have 0 reason to ever put an intensive loop like that one at the top level of your script. Move it in a function and avoid using global variables, it is designed to be more efficient.

You can take a look at the code executed for each of the byte code. I'll link the source for the 3.5 version of Python here even though I'm pretty sure 2.7 doesn't differ much. Bytecode evaluation is done in Python/ceval.c specifically in function PyEval_EvalFrameEx:

As you'll see, the *_FAST bytecodes simply get the value stored/loaded using a fastlocals local symbol table contained inside frame objects.

global vs. local namespace performance difference

1 Answers1

Linked

Related