Why is the method, I think of to be the quicker one, the slower one?

Question

I've currently got two methods for checking if a number is prime or not and another method to calculate the time both need.

IsPrime1:

bool IsPrime1(int i)
{
    if (i == 2 || i == 3 || i == 5 || i == 7) return true;
    return i % 2 != 0 && i % 3 != 0 && i % 5 != 0 && i % 7 != 0;
}

IsPrime2:

bool IsPrime2(int i)
{
    if (i == 2 || i == 3 || i == 5 || i == 7) return true;
    if (i % 2 == 0) return false;
    if (i % 3 == 0) return false;
    if (i % 5 == 0) return false;
    return i % 7 != 0;
}

CheckForTicks:

string CheckForTicks(int ticks)
{
    var sw1 = Stopwatch.StartNew();
    for (var g = 0; g < ticks; g++)
    {
        var b = IsPrime1(g);
    }
    sw1.Stop();

    var sw2 = Stopwatch.StartNew();
    for (var g = 0; g < ticks; g++)
    {
        var b = IsPrime2(g);
    }
    sw2.Stop();

    return $"{ticks} ticks: IsPrime1: {sw1.ElapsedMilliseconds} ms / IsPrime2: {sw2.ElapsedMilliseconds} ms";
    //equal to the following:
    //return string.Format("{0} ticks: IsPrime1: {1} ms / IsPrime2: {2} ms", ticks, sw1.ElapsedMilliseconds, sw2.ElapsedMilliseconds);
}

Results:

| CheckForTicks | IsPrime1 (in ms) | IsPrime2 (in ms) |
|---------------|------------------|------------------|
|        100000 |                3 |                4 |
|        500000 |               18 |               21 |
|       1000000 |               37 |               45 |
|       5000000 |              221 |              242 |
|      10000000 |              402 |              499 |
|      50000000 |             2212 |             2320 |
|     100000000 |             4377 |             4676 |
|     500000000 |            22125 |            23786 |

What I wonder is, why IsPrime2 is even slightly slower than IsPrime1.
From my point of view IsPrime2 should be much quicker as IsPrime1 because it only has to check once before the first possible return and IsPrime1 checks all possibilities.
Is there something I don't know about or is this related to .NET?

I'd be very appreciated if somebody can explain the cause of this to me.

Thanks in advance!

PS: I'm using Visual Studio 2015 RC and .NET 4.6 and ran it in Debug mode.

They're exactly equivalent, due to short-circuit evaluation of `&&`. — Oliver Charlesworth, Jun 28 '15 at 11:34
Why are you even checking for `i % 9`? That's covered by `i % 3`... — Jon Skeet, Jun 28 '15 at 11:35
Is this Release mode without debugger attached? You're lucky the benchmark loop isn't deleted entirely. — usr, Jun 28 '15 at 11:37
_IsPrime1 checks all possibilities_ No it doesn't. It is done after the 1st fail. — TaW, Jun 28 '15 at 11:44
@OliverCharlesworth gives the correct answer. But what about primes like 11, 13, ...? — James, Jun 28 '15 at 11:48
such minor differences are in the margin of error. And that margin is even bigger when you don't do a proper benchmark. — Jeroen Vannevel, Jun 28 '15 at 11:56
You could compile both variations and check if the executables are identical. I would expect them to be identical. — Jonathan, Jun 28 '15 at 12:01
@Jonathan unfortunately the compiler isn't smart enough for that :( — Lucas Trzesniewski, Jun 28 '15 at 12:39
@JeroenVannevel 5% at 22sec test time are not noise. That is highly significant. — usr, Jun 28 '15 at 13:24

score 9 · Accepted Answer · edited May 23 '17 at 12:30

Let's compare the IL code:

IsPrime1

.method private hidebysig instance bool  IsPrime1(int32 i) cil managed
{
  // Code size       45 (0x2d)
  .maxstack  8
  IL_0000:  ldarg.1
  IL_0001:  ldc.i4.2
  IL_0002:  beq.s      IL_0010
  IL_0004:  ldarg.1
  IL_0005:  ldc.i4.3
  IL_0006:  beq.s      IL_0010
  IL_0008:  ldarg.1
  IL_0009:  ldc.i4.5
  IL_000a:  beq.s      IL_0010
  IL_000c:  ldarg.1
  IL_000d:  ldc.i4.7
  IL_000e:  bne.un.s   IL_0012
  IL_0010:  ldc.i4.1
  IL_0011:  ret
  IL_0012:  ldarg.1
  IL_0013:  ldc.i4.2
  IL_0014:  rem
  IL_0015:  brfalse.s  IL_002b
  IL_0017:  ldarg.1
  IL_0018:  ldc.i4.3
  IL_0019:  rem
  IL_001a:  brfalse.s  IL_002b
  IL_001c:  ldarg.1
  IL_001d:  ldc.i4.5
  IL_001e:  rem
  IL_001f:  brfalse.s  IL_002b
  IL_0021:  ldarg.1
  IL_0022:  ldc.i4.7
  IL_0023:  rem
  IL_0024:  ldc.i4.0
  IL_0025:  ceq
  IL_0027:  ldc.i4.0
  IL_0028:  ceq
  IL_002a:  ret
  IL_002b:  ldc.i4.0
  IL_002c:  ret
} // end of method Program::IsPrime1

IsPrime2

.method private hidebysig instance bool  IsPrime2(int32 i) cil managed
{
  // Code size       49 (0x31)
  .maxstack  8
  IL_0000:  ldarg.1
  IL_0001:  ldc.i4.2
  IL_0002:  beq.s      IL_0010
  IL_0004:  ldarg.1
  IL_0005:  ldc.i4.3
  IL_0006:  beq.s      IL_0010
  IL_0008:  ldarg.1
  IL_0009:  ldc.i4.5
  IL_000a:  beq.s      IL_0010
  IL_000c:  ldarg.1
  IL_000d:  ldc.i4.7
  IL_000e:  bne.un.s   IL_0012
  IL_0010:  ldc.i4.1
  IL_0011:  ret
  IL_0012:  ldarg.1
  IL_0013:  ldc.i4.2
  IL_0014:  rem
  IL_0015:  brtrue.s   IL_0019
  IL_0017:  ldc.i4.0
  IL_0018:  ret
  IL_0019:  ldarg.1
  IL_001a:  ldc.i4.3
  IL_001b:  rem
  IL_001c:  brtrue.s   IL_0020
  IL_001e:  ldc.i4.0
  IL_001f:  ret
  IL_0020:  ldarg.1
  IL_0021:  ldc.i4.5
  IL_0022:  rem
  IL_0023:  brtrue.s   IL_0027
  IL_0025:  ldc.i4.0
  IL_0026:  ret
  IL_0027:  ldarg.1
  IL_0028:  ldc.i4.7
  IL_0029:  rem
  IL_002a:  ldc.i4.0
  IL_002b:  ceq
  IL_002d:  ldc.i4.0
  IL_002e:  ceq
  IL_0030:  ret
} // end of method Program::IsPrime2

The first part is the same for both:

  .maxstack  8
  IL_0000:  ldarg.1
  IL_0001:  ldc.i4.2
  IL_0002:  beq.s      IL_0010
  IL_0004:  ldarg.1
  IL_0005:  ldc.i4.3
  IL_0006:  beq.s      IL_0010
  IL_0008:  ldarg.1
  IL_0009:  ldc.i4.5
  IL_000a:  beq.s      IL_0010
  IL_000c:  ldarg.1
  IL_000d:  ldc.i4.7
  IL_000e:  bne.un.s   IL_0012
  IL_0010:  ldc.i4.1
  IL_0011:  ret

Without surprise, this matches:

if (i == 2 || i == 3 || i == 5 || i == 7) return true;

The rest of the code is equivalent, but the compiler generated shorter code for IsPrime1.

IL_0012:  ldarg.1              // Push i
IL_0013:  ldc.i4.2             // Push 2
IL_0014:  rem                  // Pop these and push i % 2
IL_0015:  brfalse.s  IL_002b   // Go to IL_002b if the result is 0
...                            // Repeat the same pattern for 3, 5 and 7
IL_002b:  ldc.i4.0             // Push 0 (false)
IL_002c:  ret                  // Return

Here's the same part in IsPrime2:

IL_0012:  ldarg.1              // Push i
IL_0013:  ldc.i4.2             // Push 2
IL_0014:  rem                  // Pop these and push i % 2
IL_0015:  brtrue.s   IL_0019   // Go to IL_0019 if the result is not 0
IL_0017:  ldc.i4.0             // Else load 0 (false)
IL_0018:  ret                  // ... and return
IL_0019:  ...                  // Here's the next condition
...

As you can see, the return false code is repeated in IsPrime2 several times, but is factored in the case of IsPrime1. Shorter code means less instructions to load and process, which in turn means less overhead and less processing time.

Now, what about the JIT? Does it optimize any of this?

IsPrime1 x86

            return i % 2 != 0 && i % 3 != 0 && i % 5 != 0 && i % 7 != 0;
00000022  mov         eax,esi 
00000024  and         eax,80000001h 
00000029  jns         00000030 
0000002b  dec         eax 
0000002c  or          eax,0FFFFFFFEh 
0000002f  inc         eax 
00000030  test        eax,eax 
00000032  je          00000061 
00000034  mov         eax,esi 
00000036  mov         ecx,3 
0000003b  cdq 
0000003c  idiv        eax,ecx 
0000003e  test        edx,edx 
00000040  je          00000061 
00000042  mov         eax,esi 
00000044  lea         ecx,[ecx+2] 
00000047  cdq 
00000048  idiv        eax,ecx 
0000004a  test        edx,edx 
0000004c  je          00000061 
0000004e  lea         ecx,[ecx+2] 
00000051  mov         eax,esi 
00000053  cdq 
00000054  idiv        eax,ecx 
00000056  test        edx,edx 
00000058  setne       al 
0000005b  movzx       eax,al 
0000005e  pop         esi 
0000005f  pop         ebp 
00000060  ret 
00000061  xor         eax,eax 
00000063  pop         esi 
00000064  pop         ebp 
00000065  ret

IsPrime2 x86

            if (i % 2 == 0) return false;
00000021  mov         eax,esi 
00000023  and         eax,80000001h 
00000028  jns         0000002F 
0000002a  dec         eax 
0000002b  or          eax,0FFFFFFFEh 
0000002e  inc         eax 
0000002f  test        eax,eax 
00000031  jne         00000037 
00000033  xor         eax,eax 
00000035  jmp         0000006D 
            if (i % 3 == 0) return false;
00000037  mov         eax,esi 
00000039  mov         ecx,3 
0000003e  cdq 
0000003f  idiv        eax,ecx 
00000041  test        edx,edx 
00000043  jne         00000049 
00000045  xor         eax,eax 
00000047  jmp         0000006D 
            if (i % 5 == 0) return false;
00000049  mov         eax,esi 
0000004b  mov         ecx,5 
00000050  cdq 
00000051  idiv        eax,ecx 
00000053  test        edx,edx 
00000055  jne         0000005B 
00000057  xor         eax,eax 
00000059  jmp         0000006D 
            return i % 7 != 0;
0000005b  mov         ecx,7 
00000060  mov         eax,esi 
00000062  cdq 
00000063  idiv        eax,ecx 
00000065  test        edx,edx 
00000067  setne       al 
0000006a  movzx       eax,al 
0000006d  and         eax,0FFh 
00000072  pop         esi 
00000073  pop         ebp 
00000074  ret

The answer is... the native code is still longer in the case of IsPrime2. For instance, jne 00000037 jumps to the second test, jne 00000049 jumps to the third one etc. In the case of IsPrime1, every branch points to 00000061 which is basically a return false;.

Here's the x64 code for reference:

IsPrime1 x64

            return i % 2 != 0 && i % 3 != 0 && i % 5 != 0 && i % 7 != 0;
0000001f  mov         eax,r8d 
00000022  cdq 
00000023  and         eax,1 
00000026  xor         eax,edx 
00000028  sub         eax,edx 
0000002a  test        eax,eax 
0000002c  je          000000000000008B 
0000002e  mov         eax,55555556h 
00000033  imul        r8d 
00000036  mov         eax,edx 
00000038  shr         eax,1Fh 
0000003b  add         edx,eax 
0000003d  lea         eax,[rdx+rdx*2] 
00000040  mov         ecx,r8d 
00000043  sub         ecx,eax 
00000045  test        ecx,ecx 
00000047  je          000000000000008B 
00000049  mov         eax,66666667h 
0000004e  imul        r8d 
00000051  sar         edx,1 
00000053  mov         eax,edx 
00000055  shr         eax,1Fh 
00000058  add         edx,eax 
0000005a  lea         eax,[rdx+rdx*4] 
0000005d  mov         ecx,r8d 
00000060  sub         ecx,eax 
00000062  test        ecx,ecx 
00000064  je          000000000000008B 
00000066  mov         eax,92492493h 
0000006b  imul        r8d 
0000006e  add         edx,r8d 
00000071  sar         edx,2 
00000074  mov         eax,edx 
00000076  shr         eax,1Fh 
00000079  add         edx,eax 
0000007b  imul        edx,edx,7 
0000007e  sub         r8d,edx 
00000081  xor         eax,eax 
00000083  test        r8d,r8d 
00000086  setne       al 
00000089  jmp         0000000000000092 
0000008b  xor         eax,eax 
0000008d  jmp         0000000000000092 
0000008f  nop 
            if (i == 2 || i == 3 || i == 5 || i == 7) return true;
00000090  mov         al,1 
00000092  rep ret

IsPrime2 x64

            if (i % 2 == 0) return false;
00000027  mov         eax,r8d 
0000002a  cdq 
0000002b  and         eax,1 
0000002e  xor         eax,edx 
00000030  sub         eax,edx 
00000032  test        eax,eax 
00000034  jne         000000000000003A 
00000036  xor         eax,eax 
00000038  jmp         00000000000000A2 
            if (i % 3 == 0) return false;
0000003a  mov         eax,55555556h 
0000003f  imul        r8d 
00000042  mov         eax,edx 
00000044  shr         eax,1Fh 
00000047  add         edx,eax 
00000049  lea         eax,[rdx+rdx*2] 
0000004c  mov         ecx,r8d 
0000004f  sub         ecx,eax 
00000051  test        ecx,ecx 
00000053  jne         0000000000000059 
00000055  xor         al,al 
00000057  jmp         00000000000000A2 
            if (i % 5 == 0) return false;
00000059  mov         eax,66666667h 
0000005e  imul        r8d 
00000061  sar         edx,1 
00000063  mov         eax,edx 
00000065  shr         eax,1Fh 
00000068  add         edx,eax 
0000006a  lea         eax,[rdx+rdx*4] 
0000006d  mov         ecx,r8d 
00000070  sub         ecx,eax 
00000072  test        ecx,ecx 
00000074  jne         000000000000007A 
00000076  xor         al,al 
00000078  jmp         00000000000000A2 
            return i % 7 != 0;
0000007a  mov         eax,92492493h 
0000007f  imul        r8d 
00000082  add         edx,r8d 
00000085  sar         edx,2 
00000088  mov         eax,edx 
0000008a  shr         eax,1Fh 
0000008d  add         edx,eax 
0000008f  imul        edx,edx,7 
00000092  sub         r8d,edx 
00000095  xor         eax,eax 
00000097  test        r8d,r8d 
0000009a  setne       al 
0000009d  jmp         00000000000000A2 
0000009f  nop 
            if (i == 2 || i == 3 || i == 5 || i == 7) return true;
000000a0  mov         al,1 
000000a2  rep ret

Same conclusion here. jne 0000000000000059 jumps to the second test, jne 000000000000007A jumps to the third one etc, whereas in IsPrime1 all branches point to 000000000000008B which is a return false;. Note the instruction count difference between the two versions is lower on x64 though.

Oh, and you should additionnally be aware of how branch prediction works, and how the CPU estimates if an upcoming branch is likely or unlikely to be taken.

If you rely on the .NET JIT to optimize something for you you're usually going to be disappointed. — usr, Jun 28 '15 at 13:25
@usr indeed. The JIT has too little time to attempt heavy optimizations (and control flow optimization is a heavy one). I wonder if NGen does better but something tells me I'm going to be disappointed too. — Lucas Trzesniewski, Jun 28 '15 at 13:31
And you're going to be disappointed by RyuJIT as well. I tested it thoroughly and with the exception of loop cloning (which is awesome!) it generally does worse than the old JIT. It's a step down. I would like to have a "/O2" for the .NET JIT. For web apps that's usually totally acceptable. — usr, Jun 28 '15 at 13:32
@usr argh, I didn't test it yet but they've advertised it to be better... Let's give it some time to mature though. Anyway, some optimizations just aren't feasible under JIT time constraints, and this is to be expected. — Lucas Trzesniewski, Jun 28 '15 at 13:34
I have submitted a lot of test cases to them. For example RyuJIT does not optimize away (a.x + a.x). It loads it twice from memory. Since you seem to be interested in jitted code throughput I can only encourage you to send test cases and wishes to the JIT team. In the corefx repo I have seen some optimizations that I requested land! Apparently, they take feedback seriously. But the changes are not in the latest VS preview release. I requested them 5 month ago. — usr, Jun 28 '15 at 13:39
@usr that's great news! Even if the changes aren't in the PR, you can still build your own CoreCLR. I suppose they just didn't have time to pass it though all the regression testing. We have some *very* performance/latency-sensitive code at our company, I'll make sure we take a closer look at RyuJIT. Thanks for your input! :-) — Lucas Trzesniewski, Jun 28 '15 at 13:53
@cramopy look for "IL Disassembler" in your start menu, it's a tool you get along with Visual Studio. It's not pretty but it does its job. You can also get more advanced tools like ILSpy (this one can disassemble a .NET assembly to several .NET languages (C#, VB, and IL). Make sure you compile in Release mode. — Lucas Trzesniewski, Jun 28 '15 at 17:27

score 1 · Answer 2 · answered Jun 28 '15 at 12:04

It gets a bit deep, but they do not compile to the same MSIL. Functionally they are equivalent, but at the core Prime1 only has one branch when it gets compile down, so it is either A or B. Thanks to short-circuting as soon as it his a false evaluation it stops.

Prime2 still only tests until it hits a false, but it compiles down into 4 branches instead of 1.

While there is a measurable difference in performance, in most cases I believe you would want to design that method for the more readable approach (whichever you feel that is).

I do agree with others that doing the measurements in debug mode is a bad idea, but the results are consistent. I would suggest trying it again with release mode and see if the difference stays consistent. — Jon Helms, Jun 28 '15 at 12:16

score 0 · Answer 3 · answered Jun 28 '15 at 12:16

They are not producing the same code: == is not != (that should be the same performance-wise though) and if is not using &&. Every if operation is performing cleanup (loads 0 to a register after an operation) after it's performed and && is not performing any cleanup between its evaluations.

If version:

ldarg.0
ldc.i4.2
rem
brtrue.s   IL_0020 // jumps to ret
ldc.i4.0
ret
// continues for 3, 5 and 7

&& version:

ldarg.0
ldc.i4.2
rem
brfalse.s  IL_0020 // jumps to ret
// continues for 3, 5 and 7

Why is the method, I think of to be the quicker one, the slower one?

3 Answers3