923
def main():
    for i in xrange(10**8):
        pass
main()

This piece of code in Python runs in (Note: The timing is done with the time function in BASH in Linux.)

real    0m1.841s
user    0m1.828s
sys     0m0.012s

However, if the for loop isn't placed within a function,

for i in xrange(10**8):
    pass

then it runs for a much longer time:

real    0m4.543s
user    0m4.524s
sys     0m0.012s

Why is this?

thedoctar
  • 8,943
  • 3
  • 20
  • 31
  • 18
    How did you actually do the timing? – Andrew Jaffe Jun 28 '12 at 09:21
  • 56
    Just an intuition, not sure if it's true: I would guess it's because of scopes. In the function case, a new scope is created (i.e. kind of a hash with variable names bound to their value). Without a function, variables are in the global scope, when you can find lot of stuff, hence slowing down the loop. – Scharron Jun 28 '12 at 09:22
  • I didn't believe you until I reproduced this. `Python 2.7.2 (default, Jun 12 2011, 15:08:59) [MSC v.1500 32 bit (Intel)] on win32` – Colonel Panic Jun 28 '12 at 09:22
  • 5
    @Scharron That doesn't seem to be it. Defined 200k dummy variables into the scope without that visibly affecting the running time. – Deestan Jun 28 '12 at 09:27
  • 2
    Alex Martelli wrote a good answer concerning this http://stackoverflow.com/a/1813167/174728 – John La Rooy Jun 28 '12 at 09:56
  • 61
    @Scharron you're half correct. It is about scopes, but the reason it's faster in locals is that local scopes are actually implemented as arrays instead of dictionaries (since their size is known at compile-time). – Katriel Jun 28 '12 at 10:31
  • 4
    @AndrewJaffe The output would suggest linux' `time` command. – Ward Muylaert Jun 28 '12 at 15:35
  • 1
    i just tested this snippet in IPython 2.7.5 %timeit "def main(): for i in xrange(10**8): pass; main()" => 100000000 loops, best of 3: 16.9 ns per loop # and %timeit "for i in xrange(10**8): pass" => 100000000 loops, best of 3: 16.6 ns per loop – kmonsoor Mar 14 '14 at 07:08

3 Answers3

702

Inside a function, the bytecode is:

  2           0 SETUP_LOOP              20 (to 23)
              3 LOAD_GLOBAL              0 (xrange)
              6 LOAD_CONST               3 (100000000)
              9 CALL_FUNCTION            1
             12 GET_ITER            
        >>   13 FOR_ITER                 6 (to 22)
             16 STORE_FAST               0 (i)

  3          19 JUMP_ABSOLUTE           13
        >>   22 POP_BLOCK           
        >>   23 LOAD_CONST               0 (None)
             26 RETURN_VALUE        

At the top level, the bytecode is:

  1           0 SETUP_LOOP              20 (to 23)
              3 LOAD_NAME                0 (xrange)
              6 LOAD_CONST               3 (100000000)
              9 CALL_FUNCTION            1
             12 GET_ITER            
        >>   13 FOR_ITER                 6 (to 22)
             16 STORE_NAME               1 (i)

  2          19 JUMP_ABSOLUTE           13
        >>   22 POP_BLOCK           
        >>   23 LOAD_CONST               2 (None)
             26 RETURN_VALUE        

The difference is that STORE_FAST is faster (!) than STORE_NAME. This is because in a function, i is a local but at toplevel it is a global.

To examine bytecode, use the dis module. I was able to disassemble the function directly, but to disassemble the toplevel code I had to use the compile builtin.

Arsen Khachaturyan
  • 7,904
  • 4
  • 42
  • 42
ecatmur
  • 152,476
  • 27
  • 293
  • 366
  • 183
    Confirmed by experiment. Inserting `global i` into the `main` function makes the running times equivalent. – Deestan Jun 28 '12 at 09:33
  • 54
    This answers the question without answering the question :) In the case of local function variables, CPython actually stores these in a tuple (which is mutable from the C code) until a dictionary is requested (e.g. via `locals()`, or `inspect.getframe()` etc.). Looking up an array element by a constant integer is much faster than searching a dict. – dmw Jun 28 '12 at 21:53
  • 3
    It is the same with C/C++ also, using global variables causes significant slowdown – codejammer Jun 29 '12 at 00:49
  • 3
    This is the first I've seen of bytecode.. How does one look at it, and is important to know? – Zack Jun 30 '12 at 22:30
  • only when you want to know why one thing is faster than another, or you intend to modify or extend cpython. – Warren P Jun 30 '12 at 23:23
  • the variables within functions tend to be on the top of the stack providing quicker access. Global variables are often in heap or a father away segment depending on compiler implementations. So accessing them is slower (http://stackoverflow.com/questions/1169858/global-memory-management-in-c-in-stack-or-heap) – codejammer Jul 03 '12 at 00:58
  • @codejammer This is very system dependent. On a relatively "dumb" microprocessor, it won't care one bit where the memory resides within RAM. – gkimsey Jul 05 '12 at 13:42
  • @gkimsey I agree that it relates heavily to computer architecture. However i noticed that this is the case with general purpose computers that we have – codejammer Jul 06 '12 at 13:54
  • 4
    @gkimsey I agree. Just wanted to share two things i) This behaviour is noted in other programming languages ii) The causal agent is more the architectural side and not the language itself in true sense – codejammer Jul 19 '12 at 13:28
  • How do you read the bytecode,and from where can you get the bytecode of a function? – Pratik Singhal Oct 24 '13 at 04:30
  • 3
    @Pratik see the documentation for the `dis` module: http://docs.python.org/library/dis.html – ecatmur Oct 24 '13 at 07:28
  • 1
    @ecatmur Can you tell me how you use the `compile` built in to disassemble the top level code ? – manty Jan 15 '15 at 18:15
  • 3
    @manty write `dis.dis(compile('for i in xrange(10**8):\n pass', 'main.py', 'exec'))` – ecatmur Jan 15 '15 at 18:18
  • 1
    @ecatmur you only need to `dis.dis(sourcecode)`, no need to `compile(sourcecode,"main.py",'exec')`(NOTE: Tested on python 3.10.1 only) –  Jan 14 '22 at 12:42
  • can anyone explain what is does ``` 13 FOR_ITER 6 (to 22)```` meant in output i am testing similarthing and i got it over there as well – Sahil Lohiya Dec 14 '22 at 20:23
585

You might ask why it is faster to store local variables than globals. This is a CPython implementation detail.

Remember that CPython is compiled to bytecode, which the interpreter runs. When a function is compiled, the local variables are stored in a fixed-size array (not a dict) and variable names are assigned to indexes. This is possible because you can't dynamically add local variables to a function. Then retrieving a local variable is literally a pointer lookup into the list and a refcount increase on the PyObject which is trivial.

Contrast this to a global lookup (LOAD_GLOBAL), which is a true dict search involving a hash and so on. Incidentally, this is why you need to specify global i if you want it to be global: if you ever assign to a variable inside a scope, the compiler will issue STORE_FASTs for its access unless you tell it not to.

By the way, global lookups are still pretty optimised. Attribute lookups foo.bar are the really slow ones!

Here is small illustration on local variable efficiency.

Dhia
  • 10,119
  • 11
  • 58
  • 69
Katriel
  • 120,462
  • 19
  • 136
  • 170
  • 7
    This also applies to PyPy, up to the current version (1.8 at the time of this writing.) The test code from the OP runs about four times slower in global scope compared to inside a function. – GDorn Jun 28 '12 at 17:17
  • This is similar to what was mentioned in a video on javascript scope resolution. I wouldn't have thought global variable lookups were faster than local variable property lookups. Learned something new today! – mowwwalker Jun 28 '12 at 17:45
  • 4
    @Walkerneo They aren't, unless you said it backwards. According to what katrielalex and ecatmur are saying, global variable lookups are slower than local variable lookups due to the method of storage. – Jeremy Pridemore Jun 28 '12 at 18:21
  • @JeremyPridemore, The last sentence of katrielalex'x answer is "By the way, global lookups are still pretty optimised. Attribute lookups foo.bar are the really slow ones!". Isn't he saying that global lookups are optimized and faster, but attribute lookups are really slow? – mowwwalker Jun 28 '12 at 23:18
  • 2
    @Walkerneo The primary conversation going on here is the comparison between local variable lookups within a function and global variable lookups that are defined at the module level. If you notice in your original comment reply to this answer you said "I wouldn't have thought global variable lookups were faster than local variable property lookups." and they're not. katrielalex said that, although local variable lookups are faster than global ones, even global ones are pretty optimized and faster than attribute lookups (which are different). I don't have enough room in this comment for more. – Jeremy Pridemore Jun 29 '12 at 00:21
  • @JeremyPridemore, Sorry, I'm still not seeing the difference between what I'm saying and what he was saying. If foo is local and bar is global, then accessing foo.attr quicker than bar would be in accordance with what I typed in my initial comment. – mowwwalker Jun 29 '12 at 01:50
  • 3
    @Walkerneo foo.bar is not a local access. It is an attribute of an object. (Forgive the lack of formatting)`def foo_func: x = 5`, `x` is local to a function. Accessing `x` is local. `foo = SomeClass()`, `foo.bar` is attribute access. `val = 5` global is global. As for speed local > global > attribute according to what I've read here. So accessing `x` in `foo_func` is fastest, followed by `val`, followed by `foo.bar`. `foo.attr` isn't a local lookup because in the context of this convo, we're talking about local lookups being a lookup of a variable that belongs to a function. – Jeremy Pridemore Jun 29 '12 at 01:57
  • 1
    Correct. A local variable lookup is a fixed-time pointer. A global lookup is a dict search, but with optimisations (I think the dict code is inlined, for instance). An attribute lookup is just a plain Python dict search. – Katriel Jun 29 '12 at 15:15
  • 1
    @Walkerneo Ah, I think I see where the confusion might lie. An attribute lookup like `foo.bar` is _not_ the same as looking up a variable called `"foo.bar"`; instead, it first looks up the (local) variable called `foo` but then searches for `"bar"` in `foo.__dict__`. It's this second bit that takes time. – Katriel Jun 29 '12 at 19:31
  • Is there any documentation that global variables are stored in a dict? The other answer links to documentation to STORE_FAST, used to store local variables, which says "Stores TOS into the local co_varnames[var_num]." I don't know what TOS means, but presumably this means the variable is stored in an array. However in the documentation for STORE_NAME, it's not clear a dict is used. Also, I don't know anything about CPython, so would all this information apply to normal python? – thedoctar Jul 05 '12 at 12:10
  • 4
    @thedoctar have a look at the `globals()` function. If you want more info than that you may have to start looking at the source code for Python. And CPython is just the name for the usual implementation of Python -- so you probably are using it already! – Katriel Jul 06 '12 at 14:45
  • When you guys talk about "attribute lookups" `foo.bar` being slow, does this include using `self.my_var` inside of a class's methods? If so, then why do people use attribute lookups in classes all the time instead of just using one giant global dictionary? – davidtgq Oct 03 '17 at 10:49
  • 2
    Yes, and because most of the time the slight slowness isn't actually important to the running speed of the program. – Katriel Oct 25 '17 at 20:12
  • 1
    "This is possible because you can't dynamically add local variables to a function." What about `locals()[dynamic_expr] = val`? – Quelklef Dec 05 '17 at 13:59
  • 2
    @Quelklef: Modifying the `dict` returned by `locals` is [explicitly noted to be unsupported](https://docs.python.org/3/library/functions.html#locals); it can't actually add "real" locals, but the implementation is such that the only way you could dynamically check said "locals" will happen to work, sometimes, on the CPython interpreter; when you first call `locals()` in a function it makes a mirror of the actual locals as a `dict` and caches it; if you call `locals()` again in the same function, you get the same `dict`. But it's not actually resizing the array of "real" locals. – ShadowRanger May 24 '18 at 00:25
48

Aside from local/global variable store times, opcode prediction makes the function faster.

As the other answers explain, the function uses the STORE_FAST opcode in the loop. Here's the bytecode for the function's loop:

    >>   13 FOR_ITER                 6 (to 22)   # get next value from iterator
         16 STORE_FAST               0 (x)       # set local variable
         19 JUMP_ABSOLUTE           13           # back to FOR_ITER

Normally when a program is run, Python executes each opcode one after the other, keeping track of the a stack and preforming other checks on the stack frame after each opcode is executed. Opcode prediction means that in certain cases Python is able to jump directly to the next opcode, thus avoiding some of this overhead.

In this case, every time Python sees FOR_ITER (the top of the loop), it will "predict" that STORE_FAST is the next opcode it has to execute. Python then peeks at the next opcode and, if the prediction was correct, it jumps straight to STORE_FAST. This has the effect of squeezing the two opcodes into a single opcode.

On the other hand, the STORE_NAME opcode is used in the loop at the global level. Python does *not* make similar predictions when it sees this opcode. Instead, it must go back to the top of the evaluation-loop which has obvious implications for the speed at which the loop is executed.

To give some more technical detail about this optimization, here's a quote from the ceval.c file (the "engine" of Python's virtual machine):

Some opcodes tend to come in pairs thus making it possible to predict the second code when the first is run. For example, GET_ITER is often followed by FOR_ITER. And FOR_ITER is often followed by STORE_FAST or UNPACK_SEQUENCE.

Verifying the prediction costs a single high-speed test of a register variable against a constant. If the pairing was good, then the processor's own internal branch predication has a high likelihood of success, resulting in a nearly zero-overhead transition to the next opcode. A successful prediction saves a trip through the eval-loop including its two unpredictable branches, the HAS_ARG test and the switch-case. Combined with the processor's internal branch prediction, a successful PREDICT has the effect of making the two opcodes run as if they were a single new opcode with the bodies combined.

We can see in the source code for the FOR_ITER opcode exactly where the prediction for STORE_FAST is made:

case FOR_ITER:                         // the FOR_ITER opcode case
    v = TOP();
    x = (*v->ob_type->tp_iternext)(v); // x is the next value from iterator
    if (x != NULL) {                     
        PUSH(x);                       // put x on top of the stack
        PREDICT(STORE_FAST);           // predict STORE_FAST will follow - success!
        PREDICT(UNPACK_SEQUENCE);      // this and everything below is skipped
        continue;
    }
    // error-checking and more code for when the iterator ends normally                                     

The PREDICT function expands to if (*next_instr == op) goto PRED_##op i.e. we just jump to the start of the predicted opcode. In this case, we jump here:

PREDICTED_WITH_ARG(STORE_FAST);
case STORE_FAST:
    v = POP();                     // pop x back off the stack
    SETLOCAL(oparg, v);            // set it as the new local variable
    goto fast_next_opcode;

The local variable is now set and the next opcode is up for execution. Python continues through the iterable until it reaches the end, making the successful prediction each time.

The Python wiki page has more information about how CPython's virtual machine works.

rrao
  • 297
  • 4
  • 11
Alex Riley
  • 169,130
  • 45
  • 262
  • 238
  • Minor update: As of CPython 3.6, the savings from prediction go down a bit; instead of two unpredictable branches, there is only one. The change is due to [the switch from bytecode to wordcode](https://docs.python.org/3/whatsnew/3.6.html#optimizations); now all "wordcodes" have an argument, it's just zero-ed out when the instruction doesn't logically take an argument. Thus, the `HAS_ARG` test never occurs (except when low level tracing is enabled both at compile and runtime, which no normal build does), leaving only one unpredictable jump. – ShadowRanger Mar 29 '19 at 14:56
  • Even that unpredictable jump doesn't happen in most builds of CPython, because of the new ([as of Python 3.1](https://docs.python.org/3/whatsnew/3.1.html#optimizations), [enabled by default in 3.2](https://docs.python.org/3/whatsnew/3.2.html#build-and-c-api-changes)) computed gotos behavior; when used, the `PREDICT` macro is completely disabled; instead most cases end in a `DISPATCH` that branches directly. But on branch predicting CPUs, the effect is similar to that of `PREDICT`, since branching (and prediction) is per opcode, increasing the odds of successful branch prediction. – ShadowRanger Mar 29 '19 at 15:03