Why is code using intermediate variables faster than code without?

Question

I have encountered this weird behavior and failed to explain it. These are the benchmarks:

py -3 -m timeit "tuple(range(2000)) == tuple(range(2000))"
10000 loops, best of 3: 97.7 usec per loop
py -3 -m timeit "a = tuple(range(2000));  b = tuple(range(2000)); a==b"
10000 loops, best of 3: 70.7 usec per loop

How come comparison with variable assignment is faster than using a one liner with temporary variables by more than 27%?

By the Python docs, garbage collection is disabled during timeit so it can't be that. Is it some sort of an optimization?

The results may also be reproduced in Python 2.x though to lesser extent.

Running Windows 7/10, CPython 3.5.1 all the way to 3.10.1, Intel i7 3.40 GHz, 64 bit both OS and Python. Seems like a different machine I've tried running at Intel i7 3.60 GHz with Python 3.5.0 does not reproduce the results.

Running using the same Python process with timeit.timeit() @ 10000 loops produced 0.703 and 0.804 respectively. Still shows although to lesser extent. (~12.5%)

I can't reproduce this. I get much the same times for both versions. Tried in Python 2 and Python 3. — khelwood, Apr 11 '16 at 12:25
I can't reproduce your problem. I get `146 usec per loop` and `148 usec per loop` for 1000 loop, for first and second line, respectively. The two are pretty much the same. — aluriak, Apr 11 '16 at 12:27
I __can__ reproduce this with Python 3.4.3 on an AMD CPU with Windows 10 - I wonder if it's CPU or OS specific? [Here's my output.](http://i.imgur.com/T2T9AOv.png) — Aaron Christiansen, Apr 11 '16 at 12:31
I can also reproduce the question on Fedora 21. I can also narrow it down a bit, either replacing `==` with `is`, or simply using a comma operator to create a tuple with the two values will reduce the times from (in my case) 104 and 92 micro seconds to 68 and 56. So it isn't a memory caching issue with the comparison but something directly related to creating the tuples. — Duncan, Apr 11 '16 at 12:39
Compare `dis.dis("tuple(range(2000)) == tuple(range(2000))")` to `dis.dis("a = tuple(range(2000)); b = tuple(range(2000)); a==b")`. In my configuration second snippet actually contains **all of bytecode** from first one, and some additional instructions. It's hard to believe that more bytecode instructions results in faster execution. Maybe it's some bug in specific Python version? — Łukasz Rogalski, Apr 11 '16 at 12:42
The output is `best of 3: …`. It's important, because if the returned value is the *best* of three runs, then statistically it means nothing on the mean time. Using a script using timeit on functions, i get a average time of 0.17 for both. — aluriak, Apr 11 '16 at 12:42
@Rogalski that's exactly what I thought but the results speak for themselves — Bharel, Apr 11 '16 at 12:43
Another variation on the theme: `"a=b=1; a = tuple(range(2000)); b = tuple(range(2000)); a,b"` 68µS, `"a = tuple(range(2000)); b = tuple(range(2000)); a,b"` 56µS. Also the difference gets proportionally larger if you bump the numbers up to 20,000 or 200,000 — Duncan, Apr 11 '16 at 12:45
If you attempt to reproduce this, please run the test multiple times in different execution orders. – Regardless of the result and the odd-ness of this, I think the question is not particularly valuable for SO. — poke, Apr 11 '16 at 12:49
I think this is pretty interesting. @poke you need to remember that an answer to a similar phenomenon is now the most upvoted answer in stackoverflow. — Antti Haapala -- Слава Україні, Apr 11 '16 at 12:50
Is the difference between 97 and 70 microseconds significative ? — aluriak, Apr 11 '16 at 12:52
Also, please try to run the test in a single Python process using the `timeit` module directly. Comparisons between two separate Python processes might be affected by the operating system’s task scheduler or other effects. — poke, Apr 11 '16 at 12:52
3.4 the difference is much more pronounced than 3.5 on the same computer — Antti Haapala -- Слава Україні, Apr 11 '16 at 12:59
@Duncan dis of the second seems to contain the entire first. Same as you. — Bharel, Apr 11 '16 at 13:06
I think the big lesson here, regardless of what happens is that: don't **ever** try to write complicated code avoiding a couple extra variables just because you think that would take more time or resources. — jsbueno, Apr 12 '16 at 22:32
@aluriak "best of 3" means best of three *averages*. This is done because some average might include, say, an unexpected process stall. Taking the best of averages avoids that. — Veedrac, Apr 16 '16 at 18:23

Antti Haapala -- Слава Україні · Accepted Answer · 2021-08-31T06:32:30.940

113

My results were similar to yours: the code that uses intermediate variables was consistently at least 10-20 % faster in Python 3.4. However when I used IPython on the very same Python 3.4 interpreter, I got these results:

In [1]: %timeit -n10000 -r20 tuple(range(2000)) == tuple(range(2000))
10000 loops, best of 20: 74.2 µs per loop

In [2]: %timeit -n10000 -r20 a = tuple(range(2000));  b = tuple(range(2000)); a==b
10000 loops, best of 20: 75.7 µs per loop

Notably, I never managed to get even close to the 74.2 µs for the former when I used -mtimeit from the command line.

So this Heisenbug turned out to be something quite interesting. I decided to run the command with strace and indeed there is something fishy going on:

% strace -o withoutvars python3 -m timeit "tuple(range(2000)) == tuple(range(2000))"
10000 loops, best of 3: 134 usec per loop
% strace -o withvars python3 -mtimeit "a = tuple(range(2000));  b = tuple(range(2000)); a==b"
10000 loops, best of 3: 75.8 usec per loop
% grep mmap withvars|wc -l
46
% grep mmap withoutvars|wc -l
41149

Now that is a good reason for the difference. The code that does not use variables causes the mmap system call be called almost 1000x more than the one that uses intermediate variables.

The withoutvars is full of mmap/munmap for a 256k region; these same lines are repeated over and over again:

mmap(NULL, 262144, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f32e56de000
munmap(0x7f32e56de000, 262144)          = 0
mmap(NULL, 262144, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f32e56de000
munmap(0x7f32e56de000, 262144)          = 0
mmap(NULL, 262144, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f32e56de000
munmap(0x7f32e56de000, 262144)          = 0

The mmap call seems to be coming from the function _PyObject_ArenaMmap from Objects/obmalloc.c; the obmalloc.c also contains the macro ARENA_SIZE, which is #defined to be (256 << 10) (that is 262144); similarly the munmap matches the _PyObject_ArenaMunmap from obmalloc.c.

obmalloc.c says that

Prior to Python 2.5, arenas were never free()'ed. Starting with Python 2.5, we do try to free() arenas, and use some mild heuristic strategies to increase the likelihood that arenas eventually can be freed.

Thus these heuristics and the fact that Python object allocator releases these free arenas as soon as they're emptied lead to python3 -mtimeit 'tuple(range(2000)) == tuple(range(2000))' triggering pathological behaviour where one 256 kiB memory area is re-allocated and released repeatedly; and this allocation happens with mmap/munmap, which is comparatively costly as they're system calls - furthermore, mmap with MAP_ANONYMOUS requires that the newly mapped pages must be zeroed - even though Python wouldn't care.

The behaviour is not present in the code that uses intermediate variables, because it is using slightly more memory and no memory arena can be freed as some objects are still allocated in it. That is because timeit will make it into a loop not unlike

for n in range(10000)
    a = tuple(range(2000))
    b = tuple(range(2000))
    a == b

Now the behaviour is that both a and b will stay bound until they're *reassigned, so in the second iteration, tuple(range(2000)) will allocate a 3rd tuple, and the assignment a = tuple(...) will decrease the reference count of the old tuple, causing it to be released, and increase the reference count of the new tuple; then the same happens to b. Therefore after the first iteration there are always at least 2 of these tuples, if not 3, so the thrashing doesn't occur.

Most notably it cannot be guaranteed that the code using intermediate variables is always faster - indeed in some setups it might be that using intermediate variables will result in extra mmap calls, whereas the code that compares return values directly might be fine.

Someone asked that why this happens, when timeit disables garbage collection. It is indeed true that timeit does it:

Note

By default, timeit() temporarily turns off garbage collection during the timing. The advantage of this approach is that it makes independent timings more comparable. This disadvantage is that GC may be an important component of the performance of the function being measured. If so, GC can be re-enabled as the first statement in the setup string. For example:

However, the garbage collector of Python is only there to reclaim cyclic garbage, i.e. collections of objects whose references form cycles. It is not the case here; instead these objects are freed immediately when the reference count drops to zero.

edited Aug 31 '21 at 06:32

answered Apr 11 '16 at 13:08

Antti Haapala -- Слава Україні

129,958
22
279
321

1

Woha, that's interesting. Shouldn't the garbage collector (which is disabled on timeit) take care of freeing or at least should take care of it? And it raises another question: Aren't those repeated calls a bug? – Bharel Apr 11 '16 at 14:55
6

@Bharel more like "broken as designed" – Antti Haapala -- Слава Україні Apr 11 '16 at 14:56
last but not least, how would this explain the difference between systems, even with the same OS? – Bharel Apr 11 '16 at 15:24
1

@Bharel It depends whether or not a *new* memory arena is allocated; it is quite possible that other systems have partially free arenas that have enough free memory in the pools that more is not required. Even the same Python version on superficially similar systems might have differing behaviour - things like Python installation path, number of packages in `site-packages`, environment variables, current working directory - they all affect the memory layout of the process. – Antti Haapala -- Слава Україні Apr 11 '16 at 15:32
7

@Bharel: The garbage collector in CPython is more properly called the "cyclic garbage collector"; it's solely concerned with freeing isolated reference cycles, not general garbage collection. All other cleanup is synchronous and in-order; if the last reference to the last object in an arena is released, the object is immediately deleted, and the arena immediately freed, no cyclic garbage collector involvement required. That's why it's legal to disable `gc`; if it disabled general cleanup, you'd run out of memory pretty darn quick. – ShadowRanger Apr 11 '16 at 20:34
I can't reproduce it on Ubuntu. [There is no difference](https://mybinder.org/v2/gh/zed/jupyter-notebooks/master?filepath=ipynb%2Fwhy-the-code-using-intermediate-variables-is-faster-than-the-code-without.ipynb) – jfs Dec 27 '17 at 15:48
@jfs so you think that when I say it specifically had to be run **from** command prompt, that you'd reproduce it with these parameters from an IPython notebook? – Antti Haapala -- Слава Україні Dec 27 '17 at 16:30
@AnttiHaapala there is no difference from the command-line too. – jfs Dec 27 '17 at 16:34
@jfs Ubuntu 17.10 using Python 3.6. 124-126 µsec vs 95-100 µsec. Freshly tested – Antti Haapala -- Слава Україні Dec 27 '17 at 16:39
Seems not to be triggered on Ubuntu 16.04 with Python 3.5.2. But is on Ubuntu 14.04 with 3.4.3... the tuning is a bit sensitive. – Antti Haapala -- Слава Україні Dec 27 '17 at 16:42
@jfs The interesting part is that OP spotted the behaviour on Windows and it could be reproduced in Linux – Antti Haapala -- Слава Україні Dec 27 '17 at 16:48
@AnttiHaapala I can't reproduce it. Ubuntu 16.04 Python 3.5, Python 3.6. The number of mmap calls is about the same in both cases (~40-60). – jfs Dec 27 '17 at 17:35
@jfs do you still have Python 3.4? I could - somewhat surprisingly - reproduce it in Python 3.4 on 16.04. Did you install from Deadsnakes by happenstance? – Antti Haapala -- Слава Україні Dec 27 '17 at 17:43
@AnttiHaapala I also tried it on python3.7 compiler from source. The result is the same: no difference. I've managed to reproduce the number of mmap calls using python compiled by pythonz (3.4.6, 3.6.3, 3.6.4) -- ~4000 withoutvars case, ~50 withvars. – jfs Dec 27 '17 at 17:47
1

to summarize: the effect in the answer is not reproducible (no significant difference in the number of mmap calls) on the default `/usr/bin/python3` distributed with Ubuntu 16.04 (`python3-minimal` package). I also tried various docker images e.g., `docker run --rm python:3.6.4 python -m timeit ...` -- no effect (including 3.4). The behavior in your answer is reproducible if python is compiled from source (e.g., 3.6.4-d48eceb, but no effect on 3.7-e3256087) – jfs Dec 29 '17 at 09:03
2

@jfs: The difference in behavior is largely about variations in startup behavior; if the startup creates and pollutes enough memory arenas (but doesn't pollute them enough to prevent contiguous allocation of the `tuple`s in what remains), then those arenas will never be freed, and will always available for reuse. The default `python3` on your system likely still ships with at least a few things in the system site packages directory, and the implicit `import site` done by Python will do all sorts of work that would trigger allocations/frees. – ShadowRanger Sep 13 '19 at 13:57
2

For example, I reproduce the OP's observed performance discrepancy (~123 usec without names, ~96 usec with names) when I run with `-E -S` (which disables all checks for Python environment variables and blocks importing `site` completely), but without the `-S`, the runtime is essentially identical (both ~96 usec). Running it within `ipython` achieves a similar effect because `ipython` itself does tons of stuff at startup which produces a bunch of arenas with isolated long lived objects to keep them alive for reuse. Point is, the problem isn't fixed on *any* system, it's just usually masked. – ShadowRanger Sep 13 '19 at 14:04

score 7 · Answer 2 · answered Apr 11 '16 at 13:04

The first question here has to be, is it reproducable? For some of us at least it definitely is though other people say they aren't seeing the effect. This on Fedora, with the equality test changed to is as actually doing a comparison seems irrelevant to the result, and the range pushed up to 200,000 as that seems to maximise the effect:

$ python3 -m timeit "a = tuple(range(200000));  b = tuple(range(200000)); a is b"
100 loops, best of 3: 7.03 msec per loop
$ python3 -m timeit "a = tuple(range(200000)) is tuple(range(200000))"
100 loops, best of 3: 10.2 msec per loop
$ python3 -m timeit "tuple(range(200000)) is tuple(range(200000))"
100 loops, best of 3: 10.2 msec per loop
$ python3 -m timeit "a = b = tuple(range(200000)) is tuple(range(200000))"
100 loops, best of 3: 9.99 msec per loop
$ python3 -m timeit "a = b = tuple(range(200000)) is tuple(range(200000))"
100 loops, best of 3: 10.2 msec per loop
$ python3 -m timeit "tuple(range(200000)) is tuple(range(200000))"
100 loops, best of 3: 10.1 msec per loop
$ python3 -m timeit "a = tuple(range(200000));  b = tuple(range(200000)); a is b"
100 loops, best of 3: 7 msec per loop
$ python3 -m timeit "a = tuple(range(200000));  b = tuple(range(200000)); a is b"
100 loops, best of 3: 7.02 msec per loop

I note that variations between the runs, and the order in which the expressions are run make very little difference to the result.

Adding assignments to a and b into the slow version doesn't speed it up. In fact as we might expect assigning to local variables has negligible effect. The only thing that does speed it up is splitting the expression entirely in two. The only difference this should be making is that it reduces the maximum stack depth used by Python while evaluating the expression (from 4 to 3).

That gives us the clue that the effect is related to stack depth, perhaps the extra level pushes the stack across into another memory page. If so we should see that making other changes that affect the stack will change (most likely kill the effect), and in fact that is what we see:

$ python3 -m timeit -s "def foo():
   tuple(range(200000)) is tuple(range(200000))" "foo()"
100 loops, best of 3: 10 msec per loop
$ python3 -m timeit -s "def foo():
   tuple(range(200000)) is tuple(range(200000))" "foo()"
100 loops, best of 3: 10 msec per loop
$ python3 -m timeit -s "def foo():
   a = tuple(range(200000));  b = tuple(range(200000)); a is b" "foo()"
100 loops, best of 3: 9.97 msec per loop
$ python3 -m timeit -s "def foo():
   a = tuple(range(200000));  b = tuple(range(200000)); a is b" "foo()"
100 loops, best of 3: 10 msec per loop

So, I think the effect is entirely due to how much Python stack is consumed during the timing process. It is still weird though.

2 machines with the same memory sticks and same OS cause different results though. Stack depth sounds like a good theory but it does not explain the difference between machines. — Bharel, Apr 11 '16 at 13:30

Why is code using intermediate variables faster than code without?

2 Answers2

Linked