Why is it that executing a set of commands in a function:
def main():
[do stuff]
return something
print(main())
will tend to run 1.5x
to 3x
times faster in python than executing commands in the top level:
[do stuff]
print(something)
Why is it that executing a set of commands in a function:
def main():
[do stuff]
return something
print(main())
will tend to run 1.5x
to 3x
times faster in python than executing commands in the top level:
[do stuff]
print(something)
The difference does indeed greatly depend on what "do stuff" actually does and mainly on how many times it accesses names that are defined/used. Granted that the code is similar, there is a fundamental difference between these two cases:
LOAD_FAST
/STORE_FAST
.LOAD_NAME
/STORE_NAME
which are more sluggish. This can be viewed in the following cases, I'll be using a for
loop to make sure that lookups for variables defined is performed multiple times.
Function and LOAD_FAST/STORE_FAST
:
We define a simple function that does some really silly things:
def main():
b = 20
for i in range(1000000): z = 10 * b
return z
Output generated by dis.dis
:
dis.dis(main)
# [/snipped output/]
18 GET_ITER
>> 19 FOR_ITER 16 (to 38)
22 STORE_FAST 1 (i)
25 LOAD_CONST 3 (10)
28 LOAD_FAST 0 (b)
31 BINARY_MULTIPLY
32 STORE_FAST 2 (z)
35 JUMP_ABSOLUTE 19
>> 38 POP_BLOCK
# [/snipped output/]
The thing to note here is the LOAD_FAST/STORE_FAST
commands at the offsets 28
and 32
, these are used to access the b
name used in the BINARY_MULTIPLY
operation and store the z
name, respectively. As their byte code name implies, they are the fast version of the LOAD_*/STORE_*
family.
Modules and LOAD_NAME/STORE_NAME
:
Now, let's look at the output of dis
for our module version of the previous function:
# compile the module
m = compile(open('main.py', 'r').read(), "main", "exec")
dis.dis(m)
# [/snipped output/]
18 GET_ITER
>> 19 FOR_ITER 16 (to 38)
22 STORE_NAME 2 (i)
25 LOAD_NAME 3 (z)
28 LOAD_NAME 0 (b)
31 BINARY_MULTIPLY
32 STORE_NAME 3 (z)
35 JUMP_ABSOLUTE 19
>> 38 POP_BLOCK
# [/snipped output/]
Over here we have multiple calls to LOAD_NAME/STORE_NAME
, which, as mentioned previously, are more sluggish commands to execute.
In this case, there is going to be a clear difference in execution time, mainly because Python must evaluate LOAD_NAME/STORE_NAME
and LOAD_FAST/STORE_FAST
multiple times (due to the for
loop I added) and, as a result, the overhead introduced each time the code for each byte code is executed will accumulate.
Timing the execution 'as a module':
start_time = time.time()
b = 20
for i in range(1000000): z = 10 *b
print(z)
print("Time: ", time.time() - start_time)
200
Time: 0.15162253379821777
Timing the execution as a function:
start_time = time.time()
print(main())
print("Time: ", time.time() - start_time)
200
Time: 0.08665871620178223
If you time
loops in a smaller range
(for example for i in range(1000)
) you'll notice that the 'module' version is faster. This happens because the overhead introduced by needing to call function main()
is larger than that introduced by *_FAST
vs *_NAME
differences. So it's largely relative to the amount of work that is done.
So, the real culprit here, and the reason why this difference is evident, is the for
loop used. You generally have 0
reason to ever put an intensive loop like that one at the top level of your script. Move it in a function and avoid using global variables, it is designed to be more efficient.
You can take a look at the code executed for each of the byte code. I'll link the source for the 3.5
version of Python here even though I'm pretty sure 2.7
doesn't differ much. Bytecode evaluation is done in Python/ceval.c
specifically in function PyEval_EvalFrameEx
:
As you'll see, the *_FAST
bytecodes simply get the value stored/loaded using a fastlocals
local symbol table contained inside frame objects.