Does the order of cases in a switch statement affect performance?

Question

I have a switch case program:

Ascending order switch cases :

int main()
{
        int a, sc = 1;
        switch (sc)
        {
                case 1:
                        a = 1;
                        break;
                case 2:
                        a = 2;
                        break;
        }
}

Assembly of code:

main:
        push    rbp
        mov     rbp, rsp
        mov     DWORD PTR [rbp-4], 1
        mov     eax, DWORD PTR [rbp-4]
        cmp     eax, 1
        je      .L3
        cmp     eax, 2
        je      .L4
        jmp     .L2
.L3:
        mov     DWORD PTR [rbp-8], 1
        jmp     .L2
.L4:
        mov     DWORD PTR [rbp-8], 2
        nop
.L2:
        mov     eax, 0
        pop     rbp
        ret

Descending order switch cases:

int main()
{
        int a, sc = 1;
        switch (sc)
        {
                case 2:
                        a = 1;
                        break;
                case 1:
                        a = 2;
                        break;
        }
}

Assembly of code:

main:
        push    rbp
        mov     rbp, rsp
        mov     DWORD PTR [rbp-4], 1
        mov     eax, DWORD PTR [rbp-4]
        cmp     eax, 1
        je      .L3
        cmp     eax, 2
        jne     .L2
        mov     DWORD PTR [rbp-8], 1
        jmp     .L2
.L3:
        mov     DWORD PTR [rbp-8], 2
        nop
.L2:
        mov     eax, 0
        pop     rbp
        ret

Here, ascending order cases generated more assembly than descending order.

So, if I have more numbers of switch cases, then does the order of cases affect performance?

Discussing the amount of generated instructions on non-optimized code is completely meaningless. And note that the examples aren't even equivalent, since you assign different numbers between them. Also, what does this have to do with C99 and C11? — Lundin, Nov 03 '17 at 07:51
You forgot to tell about your platform, compiler and optimization settings, and how critical is the performance to you, and what kind and size of code are you developping. If you are willing to spend 200k€ to gain a few percents of performance you should say that too. — Basile Starynkevitch, Nov 03 '17 at 09:14
Have a look at a very similar question regarding ordering of if statements: https://stackoverflow.com/questions/46833310/what-is-the-effect-of-ordering-if-else-if-statements-by-probability — Andriy Berestovskyy, Nov 03 '17 at 10:34
To simulate a real case with an input that isn't a compile-time constant, use `int foo(int sc){ ...; return a; }`. Here it is on Godbolt with gcc and clang `-O3 -mtune=intel`: https://godbolt.org/g/anj5Tu. Interestingly, clang5.0 uses a cmov chain (because I added an `a=-1` default instead of returning an uninitialized `a` when sc isn't 0 or 1. Of course, in a real use-case the code that uses `a` will get mixed into this, e.g. `if (a == 1)` later will probably actually just branch on `sc`. — Peter Cordes, Nov 04 '17 at 13:14
Although kinda obvious, it should be noted that generally, due to fallthrough, the ordering of `case` statements _also_ changes program beavior unless every `case` is ended with `break` (as happens to be in the example). — Damon, Nov 04 '17 at 17:04

Michael Geary · Accepted Answer · 2017-11-03T09:44:56.617

You're looking at unoptimized code, so studying it for performance isn't very meaningful. If you look at optimized code for your examples, you'll find that it doesn't do the comparisons at all! The optimizer notices that the switch variable sc always has the value 1, so it removes the unreachable case 2.

The optimizer also sees that the variable a isn't used after it's assigned, so it removes the code in case 1 as well, leaving main() an empty function. And it removes the function prolog/epilog that manipulates rbp since that register is unused.

So the optimized code ends up the same for either version of your main() function:

main:
    xor eax, eax
    ret

In short, for the code in the question, it doesn't matter which order you put the case statements, because none of that code will be generated at all.

Would the case order matter in a more real-life example where the code actually is generated and used? Probably not. Note that even in your unoptimized generated code, both versions test for the two case values in numeric order, checking first for 1 and then for 2, regardless of the order in the source code. Clearly the compiler is doing some sorting even in the unoptimized code.

Be sure to note Glenn and Lundin's comments: the order of the case sections is not the only change between your two examples, the actual code is different too. In one of them, the case values match the values set into a, but not so in the other.

Compilers use various strategies for switch/case statements depending on the actual values used. They may use a series of comparisons as in these examples, or perhaps a jump table. It can be interesting to study the generated code, but as always, if performance matters, watch your optimization settings and test it in a real-life situation.

Basile Starynkevitch · Answer 2 · 2017-11-03T09:08:47.907

Compiler optimization of switch statements is tricky. Of course, you need to enable optimizations (e.g. try to compile your code with gcc -O2 -fverbose-asm -S with GCC and look inside the generated .s assembler file). BTW on both of your examples my GCC 7 on Debian/Sid/x86-64 gives simply:

        .type   main, @function
main:
.LFB0:
        .cfi_startproc
# rsp.c:13: }
        xorl    %eax, %eax      #
        ret
        .cfi_endproc

^{(so there in no trace of switch in that generated code)}

If you need to understand how a compiler could optimize switch, there are some papers on that subject, such as this one.

If I have more numbers of switch cases, then an order of cases effect on performance?

Not in general, if you are using some optimizing compiler and asking it to optimize. See also this.

If that matters to you so much (but it should not, leave micro-optimizations to your compiler!), you need to benchmark, to profile and perhaps to study the generated assembler code. BTW, cache misses and register allocation could matter much more than order of case-s so I think you should not bother at all. Keep in mind the approximate timing estimates of recent computers. Put the cases in the most readable order (for the next developer working on that same source code). Read also about threaded code. If you have objective (performance related) reasons to re-order the case-s (which is very unlikely and should happen at most once in your lifetime), write some good comment explaining those reasons.

If you care that much about performance, be sure to benchmark and profile, and choose a good compiler and use it with relevant optimization options. Perhaps experiment several different optimization settings (and maybe several compilers). You may want to add -march=native (in addition of -O2 or -O3). You could consider compiling and linking with -flto -O2 to enable link-time optimizations, etc. You might also want profile based optimizations.

BTW, many compilers are huge free software projects (in particular GCC and Clang). If you care that much about performance, you might patch the compiler, extend it by adding some additional optimization pass (by forking the source code, by adding some plugin to GCC or some GCC MELT extensions). That requires months or years of work (notably to understand the internal representations and organization of that compiler).

^{(Don't forget to take development costs into account; in most cases, they cost much more)}

I'd expect that Profile Guided optimizations would be a very helpful if there are complex `switch`es in hot code. IDK if gcc always just orders the cases in ascending order in the asm, but if one is much more common than the others I expect it would check for that one first. — Peter Cordes, Nov 03 '17 at 17:26
@PeterCordes: On the other hand, dense integer switches can turn into indirect jmps which beat even the profile guided case first. — Joshua, Nov 03 '17 at 19:41
@PeterCordes: it'd be interesting if GCC, after failing to see any optimization through reordering the cases itself, allowed something like `case __builtin_likely(expr, your_case_statement)` as a further hint. However, they're not treated as constant-expressions (weirdly) and thus won't allow it(clang [seems to](https://godbolt.org/g/iLFbHx), but yielding no difference [it's still an hint, after all]). — edmz, Nov 04 '17 at 12:38

score 7 · Answer 3 · answered Nov 03 '17 at 17:58

In cases where most case labels are consecutive, compilers will often process switch statements to use jump tables rather than comparisons. The exact means by which compilers decide what form of computed jump to use (if any) will vary among different implementations. Sometimes adding extra cases to a switch statement may improve performance by simplifying a compiler's generated code (e.g. if code uses cases 4-11, while cases 0-3 are handled in default fashion, adding explicit case 0:; case 1:; case 2:; case 3:; prior to the default: may result in the compiler comparing the operand against 12 and, if it's less, using a 12-item jump table. Omitting those cases might cause the compiler to subtract 4 before comparing the difference against 8, and then use an 8-item table.

One difficulty in trying to optimize switch statements is that compilers generally know better than programmers how the performance of different approaches would vary when given certain inputs, but programmers may know better than compilers what distribution of inputs a program would receive. Given something like:

if (x==0)
  y++;
else switch(x)
{
  ...
}

a "smart" compiler might recognize that changing the code to:

switch(x)
{
  case 0:
    y++;
    break;
  ...
}

could eliminate a comparison in all cases where x is non-zero, at the cost of a computed jump when x is zero. If x is non-zero most of the time, that would be a good trade. If x is zero 99.9% of the time, however, that might be a bad trade. Different compiler writers differ as to the extent to which they will try to optimize constructs like former into the latter.

score 6 · Answer 4 · answered Nov 03 '17 at 06:44

6

Performance would depend mostly of the number of branch misses for a given dataset, not so much on the total number of cases. And that in turn depends highly on the actual data and how the compiler chose to implement the switch (dispatch table, chained conditionals, tree of conditionals -- not sure if you can even control this from C).

answered Nov 03 '17 at 06:44

Thilo

257,207
101
511
656

1

Try changing the descending code to match the ascending code where you set `a` to match the case. – Glenn Teitelbaum Nov 03 '17 at 06:57

alinsoar · Answer 5 · 2017-11-03T17:26:12.550

5

The switch statement is usually compiled via jump tables not by simple comparaisons.

So, there is no loss in performance if you permute the case-statements.

However, sometimes it is useful to keep more cases in consecutive order and not to use break/return in some entries, in order for the flow of execution to go to the next case and avoid duplicating the code.

When the differences of numbers between case number are big from one case to the other, such in case 10: and case 200000: the compiler will surely not generate jump tables, as it should fill about 200K entries almost all with a pointer toward the default: case, and in this case it will use comparaisons.

edited Nov 03 '17 at 17:26

answered Nov 03 '17 at 14:05

alinsoar

15,386
4
57
74

1

When the number of cases is small (like here), compilers use a chain of branches. Presumably gcc will order the cases based on profile-guded optimization if enabled. Another case where order doesn't matter, but using a non-jump strategy: if there are many cases that all fall through to a couple blocks of code, compilers can turn that into checking a bitmap. Even putting the bitmap in a register with an immediate: https://stackoverflow.com/questions/97987/advantage-of-switch-over-if-else-statement#comment52559281_129515 – Peter Cordes Nov 03 '17 at 17:20
@PeterCordes yes, of course, there are cases when it makes no sense to use dynamic jump. I updated the answer. – alinsoar Nov 03 '17 at 17:26
@PeterCordes one can generate a logarithmic number of comparisions, such as in a equilibrated binary search-tree if the jmp tables are too memory-demanding. – alinsoar Nov 03 '17 at 17:29
1

Yeah, IIRC gcc will do something like a binary search instead of linear when there are many cases and it doesn't use a table. – Peter Cordes Nov 03 '17 at 17:32

user8880145 · Answer 6 · 2017-11-04T13:47:03.200

You should probably enable the optimisations for your compiler before comparing assembly code, however the problem is that your variable is known at compile time so the compiler can remove everything from your function because it doesn't have any side effects.

This example shows that even if you change the order of the cases in a switch statement in your example, GCC and most other compilers will reorder them if the optimisations are enabled. I used extern functions to make sure the values are only known at run-time but I could also have used rand for example.

Also, when you add more cases, the compiler may replace the conditional jumps by a table which contains the addresses of the functions and it will still get reordered by GCC as can be seen here.

You can simplify the code by taking a function arg instead of calling `get_n()`. https://godbolt.org/g/HbxoG1 — Peter Cordes, Nov 03 '17 at 17:16

Graham · Answer 7 · 2017-11-03T12:12:53.050

Your question is very simple - your code isn't the same, so it won't produce the same assembly! Optimised code doesn't just depend on the individual statements, but also on everything around it. And in this case, it's easy to explain the optimisation.

In your first example, case 1 results in a=1, and case 2 results in a=2. The compiler can optimise this to set a=sc for those two cases, which is a single statement.

In your second example, case 1 results in a=2, and case 2 results in a=1. The compiler can no longer take that shortcut, so it has to explicitly set a=1 or a=2 for both cases. Of course this needs more code.

If you simply took your first example and swapped the order of the cases and conditional code then you should get the same assembler.

You can test this optimisation by using the code

int main()
{
    int a, sc = 1;

    switch (sc)
    {
        case 1:
        case 2:
            a = sc;
            break;
    }
}

which should also give exactly the same assembler.

Incidentally, your test code assumes that sc is actually read. Most modern optimising compilers are able to spot that sc does not change between assignment and the switch statement, and replace reading sc with a constant value 1. Further optimisation will then remove the redundant branch(es) of the switch statement, and then even the assignment could be optimised away because a does not actually change. And from the point of view of variable a, the compiler may also discover that a is not read elsewhere and so remove that variable from the code completely.

If you really want sc to be read and a to be set, you need to declare them both volatile. Fortunately the compiler seems to have implemented it the way you expected - but you absolutely cannot expect this when you have optimisation turned on.

You don't need `volatile`, you can just make `sc` a function arg and `a` a return value. (Preferably for a function not called `main`). — Peter Cordes, Nov 03 '17 at 17:22
@PeterCordes Even that may fail, because one available optimisation step is to inline functions (compiling for speed over size). I don't know personally whether compilers are smart enough to optimise constant parameters and constant return values once they've done the inline, but in principle it should be possible - which opens up the unwelcome prospect of code which changes its runtime behaviour when you upgrade your compiler. — Graham, Nov 03 '17 at 18:11
It can't fail on Godbolt when the only function in the translation unit is the one whose code you want to see! The idea is to simulate what you might get in real code when the input value isn't a compile-time constant. (Constant propagation after inlining is a valuable optimization. That's one of the major benefits of inlining! Why would you be unhappy when a new version of your compiler does a better job optimizing? Unless you've made "brittle" optimizations that happen to get good asm out of a particular compiler version I guess.) — Peter Cordes, Nov 03 '17 at 18:22
@PeterCordes That was my point. The OP is making assumptions about the expected behaviour of his compiled code, when a good optimiser could make a bit of a mess of his expectations! In his case it's not particularly "good asm" - just "asm which does anything like he was expecting" is nothing like guaranteed. — Graham, Nov 03 '17 at 19:14
OP's disassembly samples are unoptimized code, not optimized code. Any optimizer worth its name would eliminate the entire function body as dead code. — Michael Geary, Nov 05 '17 at 07:15

Does the order of cases in a switch statement affect performance?

7 Answers7