129

From this page, we know that:

Chained comparisons are faster than using the and operator. Write x < y < z instead of x < y and y < z.

However, I got a different result testing the following code snippets:

$ python -m timeit "x = 1.2" "y = 1.3" "z = 1.8" "x < y < z"
1000000 loops, best of 3: 0.322 usec per loop
$ python -m timeit "x = 1.2" "y = 1.3" "z = 1.8" "x < y and y < z"
1000000 loops, best of 3: 0.22 usec per loop
$ python -m timeit "x = 1.2" "y = 1.3" "z = 1.1" "x < y < z"
1000000 loops, best of 3: 0.279 usec per loop
$ python -m timeit "x = 1.2" "y = 1.3" "z = 1.1" "x < y and y < z"
1000000 loops, best of 3: 0.215 usec per loop

It seems that x < y and y < z is faster than x < y < z. Why?

After searching some posts in this site (like this one) I know that "evaluated only once" is the key for x < y < z, however I'm still confused. To do further study, I disassembled these two functions using dis.dis:

import dis

def chained_compare():
        x = 1.2
        y = 1.3
        z = 1.1
        x < y < z

def and_compare():
        x = 1.2
        y = 1.3
        z = 1.1
        x < y and y < z

dis.dis(chained_compare)
dis.dis(and_compare)

And the output is:

## chained_compare ##

  4           0 LOAD_CONST               1 (1.2)
              3 STORE_FAST               0 (x)

  5           6 LOAD_CONST               2 (1.3)
              9 STORE_FAST               1 (y)

  6          12 LOAD_CONST               3 (1.1)
             15 STORE_FAST               2 (z)

  7          18 LOAD_FAST                0 (x)
             21 LOAD_FAST                1 (y)
             24 DUP_TOP
             25 ROT_THREE
             26 COMPARE_OP               0 (<)
             29 JUMP_IF_FALSE_OR_POP    41
             32 LOAD_FAST                2 (z)
             35 COMPARE_OP               0 (<)
             38 JUMP_FORWARD             2 (to 43)
        >>   41 ROT_TWO
             42 POP_TOP
        >>   43 POP_TOP
             44 LOAD_CONST               0 (None)
             47 RETURN_VALUE

## and_compare ##

 10           0 LOAD_CONST               1 (1.2)
              3 STORE_FAST               0 (x)

 11           6 LOAD_CONST               2 (1.3)
              9 STORE_FAST               1 (y)

 12          12 LOAD_CONST               3 (1.1)
             15 STORE_FAST               2 (z)

 13          18 LOAD_FAST                0 (x)
             21 LOAD_FAST                1 (y)
             24 COMPARE_OP               0 (<)
             27 JUMP_IF_FALSE_OR_POP    39
             30 LOAD_FAST                1 (y)
             33 LOAD_FAST                2 (z)
             36 COMPARE_OP               0 (<)
        >>   39 POP_TOP
             40 LOAD_CONST               0 (None)

It seems that the x < y and y < z has less dissembled commands than x < y < z. Should I consider x < y and y < z faster than x < y < z?

Tested with Python 2.7.6 on an Intel(R) Xeon(R) CPU E5640 @ 2.67GHz.

Community
  • 1
  • 1
zangw
  • 43,869
  • 19
  • 177
  • 214
  • 8
    More disassambled commands doesn't mean more complexity **nor** slower code. However seeing your `timeit` tests I got interested in this. – Marco Bonelli Dec 01 '15 at 07:58
  • 6
    I assumed the speed difference from "evaluated once" comes when `y` is not just a variable lookup, but a more expensive process like a function call? I.e. `10 < max(range(100)) < 15` is faster than `10 < max(range(100)) and max(range(100)) < 15` because `max(range(100))` is called once for both comparisons. – zehnpaard Dec 01 '15 at 08:05
  • 2
    @MarcoBonelli It **does** when the disassembled code 1) doesn't contain loops and 2) every single bytecode is very very fast, because at that point the overhead of the mainloop becomes significant. – Bakuriu Dec 01 '15 at 08:23
  • @Bakuriu yes, but I just pointed out that, **in general**, shorter doesn't mean faster. – Marco Bonelli Dec 01 '15 at 08:24
  • It's sad that CPython *still* doesn't do any real optimizations on its bytecode. A straightforward dead-code elimination pass would squish both functions to [LOAD_CONST(None); RETURN_VALUE]. A straightforward CSE/constprop pass would convert both versions of a non-vacuous variant (suppose `y` was an argument and the result of the comparison was returned) to the same sequence of bytecodes and we wouldn't be having this conversation. – zwol Dec 01 '15 at 14:10
  • Just to mention something that I got wrong on first guess: `x < y < z` still does short-circuiting if `x >= y`, so `x < y and y < z` (which more obviously avoids evaluating `z` if possible) doesn't win there. – Ben Millwood Dec 01 '15 at 16:38
  • 2
    Branch prediction might mess up your tests. – Corey Ogburn Dec 01 '15 at 18:03
  • 2
    @zehnpaard I agree with you. When "y" is more than a simple value (e.g., a function call or calculation) I'd expect the fact that "y" is evaluated once in x – MartyMacGyver Dec 03 '15 at 21:21

4 Answers4

111

The difference is that in x < y < z y is only evaluated once. This does not make a large difference if y is a variable, but it does when it is a function call, which takes some time to compute.

from time import sleep
def y():
    sleep(.2)
    return 1.3
%timeit 1.2 < y() < 1.8
10 loops, best of 3: 203 ms per loop
%timeit 1.2 < y() and y() < 1.8
1 loops, best of 3: 405 ms per loop
Marco Bonelli
  • 63,369
  • 21
  • 118
  • 128
Rob
  • 3,418
  • 1
  • 19
  • 27
  • 18
    Of course, there could also be a semantic difference. Not only could y() return two different values, but with a variable the evaluation of the less-than operator in x < y could change the value of y. This is why it's often not optimized away in the byte code (if y is a non-local variable or one participating in a closure, for example) – Random832 Dec 01 '15 at 20:51
  • Just curious, why did you need a `sleep()` inside the function? – Prof Dec 11 '15 at 22:15
  • @Prof That is to simulate a function which takes some time to compute the result. If the functions returns right away, there will not be much difference between the two timeit results. – Rob Dec 14 '15 at 07:52
  • @Rob Why wouldn't there be much of a difference? It would be 3ms vs 205ms, that demonstrates it well enough doesn't it? – Prof Dec 14 '15 at 16:40
  • @Prof You are missing the point that y() is calculated twice, so that is 2x200ms instead of 1x200ms. The rest (3/5 ms) is irrelevant noise in the timing measurement. – Rob Dec 17 '15 at 14:00
22

Optimal bytecode for both of the functions you defined would be

          0 LOAD_CONST               0 (None)
          3 RETURN_VALUE

because the result of the comparison is not used. Let's make the situation more interesting by returning the result of the comparison. Let's also have the result not be knowable at compile time.

def interesting_compare(y):
    x = 1.1
    z = 1.3
    return x < y < z  # or: x < y and y < z

Again, the two versions of the comparison are semantically identical, so the optimal bytecode is the same for both constructs. As best I can work it out, it would look like this. I've annotated each line with the stack contents before and after each opcode, in Forth notation (top of stack at right, -- divides before and after, trailing ? indicates something that might or might not be there). Note that RETURN_VALUE discards everything that happens to be left on the stack underneath the value returned.

          0 LOAD_FAST                0 (y)    ;          -- y
          3 DUP_TOP                           ; y        -- y y
          4 LOAD_CONST               0 (1.1)  ; y y      -- y y 1.1
          7 COMPARE_OP               4 (>)    ; y y 1.1  -- y pred
         10 JUMP_IF_FALSE_OR_POP     19       ; y pred   -- y
         13 LOAD_CONST               1 (1.3)  ; y        -- y 1.3
         16 COMPARE_OP               0 (<)    ; y 1.3    -- pred
     >>  19 RETURN_VALUE                      ; y? pred  --

If an implementation of the language, CPython, PyPy, whatever, does not generate this bytecode (or its own equivalent sequence of operations) for both variations, that demonstrates the poor quality of that bytecode compiler. Getting from the bytecode sequences you posted to the above is a solved problem (I think all you need for this case is constant folding, dead code elimination, and better modeling of the contents of the stack; common subexpression elimination would also be cheap and valuable), and there's really no excuse for not doing it in a modern language implementation.

Now, it happens that all current implementations of the language have poor-quality bytecode compilers. But you should ignore that while coding! Pretend the bytecode compiler is good, and write the most readable code. It will probably be plenty fast enough anyway. If it isn't, look for algorithmic improvements first, and give Cython a try second -- that will provide far more improvement for the same effort than any expression-level tweaks you might apply.

zwol
  • 135,547
  • 38
  • 252
  • 361
  • Well the most important of all optimisations would have to be possible in the first place: inlining. And that's far from a "solved problem" for dynamic languages that allow to change the implementation of a function dynamically (doable though - HotSpot can do similar things and things like Graal are working on making these kind of optimisations available to Python and other dynamic languages). And since the function itself might be called from different modules (or a call might get dynamically generated!) you really can't do these optimisations there. – Voo Dec 01 '15 at 20:49
  • 1
    @Voo My hand-optimized bytecode has exactly the same semantics as the original even in the presence of arbitrary dynamism (one exception: a < b ≡ b > a is assumed). Also, inlining is overrated. If you do too much of it -- and it's far too easy to do too much of it -- you blow the I-cache and lose everything you gained and then some. – zwol Dec 01 '15 at 20:59
  • You're right I thought you meant you wanted to optimize your `interesting_compare` to the simple bytecode at the top (which would only work with inlining). Completely offtopic but: Inlining is one of the most essential optimisations for any compiler. You can try to run some benchmarks with say HotSpot on real programs (not some math tests that spend 99% of their time in one hot loop that's hand optimized [and hence doesn't have any more function calls anyhow]) when you disable its ability to inline anything - you'll see large regressions. – Voo Dec 01 '15 at 21:05
  • @Voo Yeah, the simple bytecode at the top was supposed to be the "optimal version of" the OP's original code, not `interesting_compare`. – zwol Dec 01 '15 at 21:08
  • "one exception: a < b ≡ b > a is assumed" → which is simply not true in Python. Plus, I think CPython can't even truly assume operations on `y` don't change the stack, since it has a lot of debug tools. – Veedrac Dec 08 '15 at 02:27
  • @Veedrac You're *supposed* to define comparison consistently; the bytecode could avoid the assumption with an OVER opcode (`a b -- a b a`) but there isn't one. The API used by `pdb` can, I think, muck with the operand stack, but things a debugger might do are generally considered ignorable for optimization purposes (and you'll note that that API is very vaguely documented and plastered over with 'implementation specific, may change at any time' warnings). – zwol Dec 08 '15 at 15:42
  • @zwol But not everyone does implement comparison consistently. The compiler certainly can't assume it. Further, CPython explicitly doesn't go the "ignore X for optimization purposes" route, since if you wanted a language that optimized at your expense you shouldn't be using CPython. If debug functionality didn't work consistently it would be a bug in CPython. – Veedrac Dec 08 '15 at 16:19
  • pypy's bytecode compiler may not produce optimal bytecode, but if you're using pypy, that's probably not the thing you care about, anyway. – lahwran Dec 21 '15 at 17:58
8

Since the difference in the output seem to be due to lack of optimization I think you should ignore that difference for most cases - it could be that the difference will go away. The difference is because y only should be evaluated once and that is solved by duplicating it on the stack which requires an extra POP_TOP - the solution to use LOAD_FAST might be possible though.

The important difference though is that in x<y and y<z the second y should be evaluated twice if x<y evaluates to true, this has implications if the evaluation of y takes considerable time or have side effects.

In most scenarios you should use x<y<z despite the fact it's somewhat slower.

skyking
  • 13,817
  • 1
  • 35
  • 57
6

First of all, your comparison is pretty much meaningless because the two different constructs were not introduced to provide a performance improvement, so you shouldn't decide whether to use one in place of the other based on that.

The x < y < z construct:

  1. Is clearer and more direct in its meaning.
  2. Its semantics is what you'd expect from the "mathematical meaning" of the comparison: evalute x, y and z once and check if the whole condition holds. Using and changes the semantics by evaluating y multiple times, which can change the result.

So choose one in place of the other depending on the semantics you want and, if they are equivalent, whether one is more readable than the other.

This said: more disassembled code does does not imply slower code. However executing more bytecode operations means that each operation is simpler and yet it requires an iteration of the main loop. This means that if the operations you are performing are extremely fast (e.g. local variable lookup as you are doing there), then the overhead of executing more bytecode operations can matter.

But note that this result does not hold in the more generic situation, only to the "worst case" that you happen to profile. As others have noted, if you change y to something that takes even a bit more time you'll see that the results change, because the chained notation evaluates it only once.

Summarizing:

  • Consider semantics before performance.
  • Take into account readability.
  • Don't trust micro benchmarks. Always profile with different kind of parameters to see how a function/expression timing behave in relation to said parameters and consider how you plan to use it.
Bakuriu
  • 98,325
  • 22
  • 197
  • 231
  • 5
    I think your answer doesn't include the straightforward and relevant fact that the quoted page, in the particular case of the question - comparing floats, is simply wrong. The chained comparison is not faster as seen in both measurements and in the generated bytecode. – pvg Dec 01 '15 at 08:32
  • 30
    Answering a question tagged performance with "maybe you shouldn't be thinking about performance so much" doesn't seem useful to me. You're making potentially patronising assumptions about the questioner's grasp of general programming principles and then mostly talking about them instead of the issue at hand. – Ben Millwood Dec 01 '15 at 16:33
  • @Veerdac you're misreading the comment. The proposed optimization in the original document the OP was relying on is wrong, in the case of floats at a minimum. It is not faster. – pvg Dec 08 '15 at 02:37