2

I create two long lists repeating two different values. In the first list the values alternate, in the second list one value's occurrences come before the other's:

a1 = [object(), object()] * 10**6
a2 = a1[::2] + a1[1::2]

Then I iterate them, doing nothing with them:

for _ in a1: pass
for _ in a2: pass

Which of the two gets iterated faster? Depends on how I measure! I did 50 races with each timing method:

timing method: timeit.timeit
  list a1 won 50 times
  list a2 won 0 times

timing method: timeit.default_timer
  list a1 won 0 times
  list a2 won 50 times

Why do the two timing methods give me completely opposite results? And why are there speed differences between the two lists at all? I'd expect something more like 25-25 twice, instead of 50-0 and 0-50.

Reasons from previous similar questions and why I don't think they're responsible here:

  • branch prediction: I don't have any branching that could make a difference.
  • cache misses: Shouldn't happen, as I only have two values (and I don't even do anything with them).
  • garbage collection: Not involved here.
  • None of these would explain the opposite speed differences between the timing methods anyway.

I'm using Python 3.10, same results on a weak Windows laptop and a strong Linux Google Compute Engine instance.

Full code:

from timeit import timeit, default_timer

a1 = [object(), object()] * 10**6
a2 = a1[::2] + a1[1::2]

for method in 'timeit', 'default_timer':
    wins1 = wins2 = 0

    for _ in range(50):

        time1 = time2 = float('inf')
        for _ in range(5):

            if method == 'timeit':
                t1 = timeit('for _ in a1: pass', 'from __main__ import a1', number=1)
                t2 = timeit('for _ in a2: pass', 'from __main__ import a2', number=1)
            else:
                start = default_timer();
                for _ in a1: pass
                end = default_timer()
                t1 = end - start

                start = default_timer();
                for _ in a2: pass
                end = default_timer()
                t2 = end - start

            time1 = min(time1, t1)
            time2 = min(time2, t2)

        wins1 += time1 < time2
        wins2 += time2 < time1

    print(f'timing method: timeit.{method}')
    print(f'  list a1 won {wins1} times')
    print(f'  list a2 won {wins2} times')
    print()
mkrieger1
  • 19,194
  • 5
  • 54
  • 65
no comment
  • 6,381
  • 4
  • 12
  • 30
  • 3
    Okay, after reading your question fully, it might be in order to clarify that one was not faster than the other by a *factor* of 50, but that after racing them against each other 50 times, one won 50 times and the other won 0 times. – mkrieger1 Nov 16 '21 at 16:43
  • 1
    You could reduce systematic effects by choosing the order in which the measurements are done randomly. Then the question might not be "why is x faster than y" but "why is the first measurement faster than the second measurement" – mkrieger1 Nov 16 '21 at 16:46
  • 1
    @mkrieger1 Thanks, changed. Is it clear now? – no comment Nov 16 '21 at 16:47
  • BTW I get a different result: 50-0 and 50-0 in both timing methods. – mkrieger1 Nov 16 '21 at 16:50
  • @mkrieger1 Random sounds more complicated, but I did `a1, a2 = a2, a1` now right after their creation, and the results flipped accordingly (`50-0 + 0-50` became `0-50 + 50-0`). – no comment Nov 16 '21 at 16:50
  • 2
    One thing I want to point out right away in case this isn't obvious - your lists aren't alternating values, but instead references to two objects. You have two objects in memory, and both of your lists simply contain references which point to those same objects. Why the ordering of those references causes different timer methods to measure differently, I'm not sure, I would need to investigate both methods deeply. – Peter White Nov 16 '21 at 16:51
  • After choosing the measurement order between `t1` and `t2` randomly each time, `a1` still consistently wins with 50-0 and 47-3 for me. – mkrieger1 Nov 16 '21 at 16:54
  • @diggusbickus How mixed? – no comment Nov 16 '21 at 16:55
  • @mkrieger1 At [tio.run](https://tio.run/##rVLdTsMgFL7nKc5dy1bNwGhMtU@yLA1zdGJa2gDLYpY9e@X0Z1tl0wvlhkPP93OgX/Pp3mv98NyYti1MXYFTlVQOVNXUxg2nBDayELvS5Xg2hAgGGSzr9Yd8czFNYKxWMAO2mM2eiOAeIdgyTfkK5lgxLAkpagOV9KYbUBqi3iBKIJpYRCkBv/ZKW7TCHQUXpPuMGjnSjdBbGT8uaNo3cCEfObgjpyhr4eJI6SKiJ9B3hUsBXKoYh8yy05DpBNJ5jUbKO5w0BUuhEdbirbo3zfNKKJ3n46sK5lt6V62lyRgNVflVVf6bKr@pKksrw@mtE56WTX9uTF8CYHCzACH1JhS6/lwIveu9yX@MxP82Ev9ppDFLldJxVyf@DjSA8AsI9xBOzzJ9hOfZoPXaEyZtPrb50GY9vTFKu7jA@Cm9HQKZDsm4P/Tn45DqEQxQKosRg32t4dDZHzuOvYHkZyS/hqRt@wU) I usually get "similar" results as mine, last run 36-14 and 0-50. – no comment Nov 16 '21 at 16:58
  • like never 50-0 (most in the range 15~35) – diggusbickus Nov 16 '21 at 17:00
  • @diggusbickus What kind of computer and what Python version are you using? – no comment Nov 16 '21 at 17:06
  • @mkrieger1 Can you share your randomized code so I could try it on my computers? – no comment Nov 16 '21 at 17:06
  • I used [this](https://dpaste.org/OF1v) which also shows mean values of the timings and the random order can be weighted by a command line argument. – mkrieger1 Nov 16 '21 at 17:18
  • @mkrieger1 I ran your code on the GCE instance, got 50-0 and 0-50 again. – no comment Nov 16 '21 at 17:28
  • @mkrieger1 I don't have numpy (or even pip) installed there, so I used `from statistics import mean, stdev as std` instead and got [these results](https://dpaste.org/0UbQ). What means/stds do you get? – no comment Nov 16 '21 at 17:35
  • I get similar numbers, but they vary from run to run. – mkrieger1 Nov 16 '21 at 17:59
  • I think you are measuring the impact of low-level effect of the CPython internal implementation. I think it is pointless to perform such micro-benchmarking with CPython as the results will probably be very dependent of the target version, the target hardware, and in the end, not very useful, because you measure the *overhead of the interpreter* end not really the code itself. – Jérôme Richard Nov 16 '21 at 19:09
  • @JérômeRichard Maybe / maybe for you, but I'm still interested in finding out the reason(s) and hope to learn something from this. Already did, actually. – no comment Nov 16 '21 at 19:24
  • 1
    @JérômeRichard (continued) See [this answer](https://stackoverflow.com/a/42110976/16759116) to yet another similar question. That also observed a single repeated value being slower than two values repeated alternatingly, and theorizes that "pipeline stalls" might be the culprit. If that's one of the reasons, then I'd learn something about *that* here, which would not just be knowledge about CPython. – no comment Nov 16 '21 at 19:36
  • 1
    @JérômeRichard (continued) And if you run mkrieger1's version, you'll likely see that the first timing method leads to times about 3 times (!) faster, and the reason isn't a CPython implementation detail likely to change or a hardware issue but a real difference between the timing methods that I find good to know and believe can have a real impact on *normal* code. – no comment Nov 16 '21 at 19:57

1 Answers1

2

Not "an answer", but a clue: add

def f(a):
    for _ in a: pass

and then in the default_timer branch replace

for _ in a1: pass

with f(a1), and similarly for the iteration over a2.

Two things I see then:

  1. Output changes to more uniform
timing method: timeit.timeit

  list a1 won 50 times
  list a2 won 0 times

timing method: timeit.default_timer
  list a1 won 50 times
  list a2 won 0 times
  1. It's no longer the case that timeit runs faster than default_timer,

#2 is so obvious it may be hard to notice why ;-) In the original, the for-loop target _ is a module global, so updating it on each iteration requires a dict lookup and store. But timeit builds a function out of the stuff passed to it, so in the code timeit runs, _ is a local variable instead, and a faster STORE_FAST indexed store suffices.

Function f() was introduced so that the "heavy lifting" is done in both cases on local variables.

The code being timed here does so very little that the difference between a dict lookup and an index-a-vector-with-an-int operation can be highly significant

Tim Peters
  • 67,464
  • 13
  • 126
  • 132
  • I think this is a complete answer to the question. Good job. See [this code](https://pastebin.com/pCTJudJG) to test the impact of using a global variable instead of a local one on performance. The assembly difference can be seen [here](https://godbolt.org/z/EE1vjT37n). For more information, please read [Performance with global variables vs local](https://stackoverflow.com/questions/12590058/performance-with-global-variables-vs-local). One solution to fix the issue should be just to call the code from a `main` function. – Jérôme Richard Nov 17 '21 at 09:38
  • @JérômeRichard Not complete, only covers the "easy" part. Still unclear why the alternating list is faster with a local variable but slower with a global variable. – no comment Nov 17 '21 at 12:58
  • @nocomment The local version uses a `STORE_FAST` instruction which is faster than the `STORE_NAME` global alternative. The `xxx_FAST` version are faster because of the internals of CPython: the local variables are stored in a vector that can be directly indexed by an integer while globals are fetched from a slow dictionary which is typically a hashmap with string keys (indicated shortly in the answer). If you want to now more about that you can read [this](https://tenthousandmeters.com/blog/python-behind-the-scenes-5-how-variables-are-implemented-in-cpython/) (about the CPython code). – Jérôme Richard Nov 17 '21 at 15:44
  • @JérômeRichard I know all that. You're missing the issue. Let me rephrase: Why is `a2` *slower* than `a1` when using *local* `_` with both, and why is `a2` *faster* than `a1` when using *global* `_` with both? – no comment Nov 17 '21 at 16:05
  • @nocomment, not everyone sees the same outcomes we saw. That means we're probably seeing effects peculiar to the CPU in use. In which case, Python is one of the worst languages on the planet in which to pursue it :-) Assembler would be ideal, or at worst very simple C code. My interest in pursuing this ended after determining that "the first timing method leads to times about 3 times (!) faster" was due to that the two timing methods actually ran code that differs - _at_ the Python level - in a supremely relevant way (as this answer spelled out). – Tim Peters Nov 17 '21 at 18:40