Conditional tests in primality by trial division

Question

My question is about the conditional test in trial division. There seems to be some debate on what conditional test to employ. Let's look at the code for this from RosettaCode.

int is_prime(unsigned int n)
{
    unsigned int p;
    if (!(n & 1) || n < 2 ) return n == 2;

    /* comparing p*p <= n can overflow */
    for (p = 3; p <= n/p; p += 2)
        if (!(n % p)) return 0;
    return 1;
}

Wheel factorization or using a predetermined list of primes will not change the essence of my question.

There are three cases I can think of to do the conditional test:

p<=n/p
p*p<=n
int cut = sqrt(n); for (p = 3; p <= cut; p += 2)

Case 1: works for all n but it has to do an extra division every iteration (edit: actually it does not need an extra division but it's still slower. I'm not sure why. See the assembly output below). I have found it to be twice as slow as case 2 for large values of n which are prime (on my Sandy Bridge system).

Case 2: is signficantly faster than case 1 but it has a problem that it overflows for large n and goes into an infinitive loop. The max value it can handle is

(sqrt(n) + c)^2 = INT_MAX  //solve
n = INT_MAX -2*c*sqrt(INT_MAX) + c^2
//INT_MAX = 2^32 -> n = 2^32 - c*s^17 + c^2; in our case c = 2

For example for uint64_t case 2 will go into an infinite loop for x =-1L-58 (x^64-59) which is a prime.

Case 3: no division or multiplication has to be done every iteration and it does not overflow like case 2. It's slightly faster than case 2 as well. The only question is if sqrt(n) is accurate enough.

Can someone explain to me why case 2 is so much faster than case 1? Case 1 does NOT use an extra division as I though but despite that it's still a lot slower.

Here are the times for the prime 2^56-5;

case 1 9.0s
case 2 4.6s
case 3 4.5s

Here is the code I used to test this http://coliru.stacked-crooked.com/a/69497863a97d8953. I also added the functions to the end of this question.

Here is the assembly output form GCC 4.8 with -O3 for case 1 and case 2. They both only have one division. Case 2 has a multiplication as well so my first guess is that case 2 would be slower but it's about twice as fast on both GCC and MSVC. I don't know why.

Case 1:

.L5:
  testl %edx, %edx
  je  .L8
.L4:
  addl  $2, %ecx
  xorl  %edx, %edx
  movl  %edi, %eax
  divl  %ecx
  cmpl  %ecx, %eax
  jae .L5

Case 2:

.L20:
  xorl  %edx, %edx
  movl  %edi, %eax
  divl  %ecx
  testl %edx, %edx
  je  .L23
.L19:
  addl  $2, %ecx
  movl  %ecx, %eax
  imull %ecx, %eax
  cmpl  %eax, %edi
  jae .L20

Here are the functions I'm using to test the time:

int is_prime(uint64_t n)
{
    uint64_t p;
    if (!(n & 1) || n < 2 ) return n == 2;

    /* comparing p*p <= n can overflow */
    for (p = 3; p <= n/p; p += 2)
        if (!(n % p)) return 0;
    return 1;
}

int is_prime2(uint64_t n)
{
    uint64_t p;
    if (!(n & 1) || n < 2 ) return n == 2;

    /* comparing p*p <= n can overflow */
    for (p = 3; p*p <= n; p += 2)
        if (!(n % p)) return 0;
    return 1;
}

int is_prime3(uint64_t n)
{
    uint64_t p;
    if (!(n & 1) || n < 2 ) return n == 2;

    /* comparing p*p <= n can overflow */
    uint32_t cut = sqrt(n);
    for (p = 3; p <= cut; p += 2)
        if (!(n % p)) return 0;
    return 1;
}

Added content after the bounty.

Aean discovered that in case 1 saving the quotient as well as the remainder is just as fast (or slightly faster) than case 2. Let's call this case 4. The following following code is twice as fast as case 1.

int is_prime4(uint64_t n)
{
    uint64_t p, q, r;
    if (!(n & 1) || n < 2 ) return n == 2;

    for (p = 3, q=n/p, r=n%p; p <= q; p += 2, q = n/p, r=n%p)
        if (!r) return 0;
    return 1;
}

I'm not sure why this helps. In any case there is no need to use case 2 anymore. For case 3, most versions of the sqrt function in hardware or software get the perfect squares right so it's safe to use in general. Case 3 is the only case that will work with OpenMP.

Case 1 shouldn't really involve an extra division: your processor is probably capable of computing both the quotient and remainder in a single instruction, and that's exposed to C (at least in C99) in the form of the `div` function. Have you tried using that? — Mark Dickinson, Mar 21 '14 at 10:58
Reference for `div` and `ldiv`: see section 7.20.6.2 of http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1256.pdf — Mark Dickinson, Mar 21 '14 at 10:59
@MarkDickinson, no I have not tried that. I assumed the compiler would do this optimization for me (MSVC2012 and GCC 4.8+)... — Z boson, Mar 21 '14 at 11:49
Hmm; yes, I guess it's possible it would, depending on optimisation levels, etc. You could take a look at the assembly output (compile with `-S` flag) to check. — Mark Dickinson, Mar 21 '14 at 11:51
@MarkDickinson, I looked into those functions. They call external functions and don't inline a single instruction. However, I looked at the assembly output of case 1 and 2 and they both only do division once. I don't know why case 1 is so much slower. I updated my answer with the assembly output. — Z boson, Mar 21 '14 at 12:50
@MarkDickinson, I think it might have to do with out of order execution. Case one has to wait for the division to finish and case 2 can start the next block of code before the division finishes. — Z boson, Mar 21 '14 at 12:56
Darn. That's not much of a win for `div` and `ldiv`, then. I guess the separation between `gcc` and `glibc` probably isn't helping here. Your suggestion about execution order sounds promising. I'm getting out of my depth here, though (and out of time to spend thinking about it), so I'm going to bow out at this point. — Mark Dickinson, Mar 21 '14 at 13:47
(and sadly `div` and `ldiv` don't appear here: http://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html) — Mark Dickinson, Mar 21 '14 at 13:58
@MarkDickinson, my guess is that they are something for the past not recommended anymore. Best to let the compiler take care of this. — Z boson, Mar 21 '14 at 15:00
"Case 3 is the superior case in general" does oblige a floating point library (often OK) and a `sqrt()` of sufficient precision (usual) and properly constructed to return the _best_ answer (there are some spotty libraries). If looking for a high quality solution, simple code an `uisqrt()` function. It does not need to be blazing fast. See http://stackoverflow.com/questions/18499492/how-can-you-easily-calculate-the-square-root-of-an-unsigned-long-long-in-c — chux - Reinstate Monica, Mar 21 '14 at 21:12
@chux, I'm interested in the SQRTSD (or SQRTPD) SSE2 instruction. I wish I could find a source on it's accuracy. But I might just write `uisqrt()` for the learning. With Newton it has quadratic convergence so it will be more or less blazingly fast anyway. — Z boson, Mar 21 '14 at 21:37
Even a simple test and bit shift is only 64*n simple instructions. Confident that the `SSE2` is good to the last bit. The tricky part is that it might give differing results depending on rounding mode IDNK. Rolling your own code can avoid rounding mode settings. — chux - Reinstate Monica, Mar 21 '14 at 21:41

Aean · Accepted Answer · 2014-03-25T12:05:43.513

4

UPD: This is a compiler optimization issue, obviously. While MinGW used only one div instruction in loop body, both GCC on Linux and MSVC failed to reuse the quotient from previous iteration.

I think the best we could do is explicitly define quo and rem and calculate them in the same basic instruction block, to show the compiler we want both quotient and remainder.

int is_prime(uint64_t n)
{
    uint64_t p = 3, quo, rem;
    if (!(n & 1) || n < 2) return n == 2;

    quo = n / p;
    for (; p <= quo; p += 2){
        quo = n / p; rem = n % p;
        if (!(rem)) return 0;
    }
    return 1;
}

I tried your code from http://coliru.stacked-crooked.com/a/69497863a97d8953 on a MinGW-w64 compiler, case 1 is faster than case 2.

enter image description here

So I guess you are compiling targeted to a 32-bit architecture and used uint64_t type. Your assembly shows it doesn't use any 64-bit register.

If I got it right, there is the reason.

On 32-bit architecture, 64-bit numbers is represented in two 32-bit registers, your compiler will do all concatenation works. It's simple to do 64-bit addition, subtraction and multiplication. But modulo and division is done by a small function call which named as ___umoddi3 and ___udivdi3 in GCC, aullrem and aulldiv in MSVC.

So actually you need one ___umoddi3 and one ___udivdi3 for each iteration in case 1, one ___udivdi3 and one concatenation of 64-bit multiplication in case 2. That's why case 1 seems twice slower than case 2 in your test.

What you really get in case 1:

L5:
    addl    $2, %esi
    adcl    $0, %edi
    movl    %esi, 8(%esp)
    movl    %edi, 12(%esp)
    movl    %ebx, (%esp)
    movl    %ebp, 4(%esp)
    call    ___udivdi3         // A call for div
    cmpl    %edi, %edx
    ja  L6
    jae L21
L6:
    movl    %esi, 8(%esp)
    movl    %edi, 12(%esp)
    movl    %ebx, (%esp)
    movl    %ebp, 4(%esp)
    call    ___umoddi3        // A call for modulo.
    orl %eax, %edx
    jne L5

What you really get in case 2:

L26:
    addl    $2, %esi
    adcl    $0, %edi
    movl    %esi, %eax
    movl    %edi, %ecx
    imull   %esi, %ecx
    mull    %esi
    addl    %ecx, %ecx
    addl    %ecx, %edx
    cmpl    %edx, %ebx
    ja  L27
    jae L41
L27:
    movl    %esi, 8(%esp)
    movl    %edi, 12(%esp)
    movl    %ebp, (%esp)
    movl    %ebx, 4(%esp)
    call    ___umoddi3         // Just one call for modulo
    orl %eax, %edx
    jne L26

MSVC failed to reuse the result of div. The optimization is broken by return. Try these code:

__declspec(noinline) int is_prime_A(unsigned int n)
{
    unsigned int p;
    int ret = -1;
    if (!(n & 1) || n < 2) return n == 2;

    /* comparing p*p <= n can overflow */
    p = 1;
    do {
        p += 2;
        if (p >= n / p) ret = 1; /* Let's return latter outside the loop. */
        if (!(n % p)) ret = 0;
    } while (ret < 0);
    return ret;
}

__declspec(noinline) int is_prime_B(unsigned int n)
{
    unsigned int p;
    if (!(n & 1) || n < 2) return n == 2;

    /* comparing p*p <= n can overflow */
    p = 1;
    do {
        p += 2;
        if (p > n / p) return 1; /* The common routine. */
        if (!(n % p)) return 0;
    } while (1);
}

The is_prime_B will be twice slower than is_prime_A on MSVC / ICC for windows.

edited Mar 25 '14 at 12:05

answered Mar 25 '14 at 08:28

Aean

759
1
6
16

Interesting. I ran it in MSVC201 64-bit mode release and got: case 1 1.8s, case 2 1.2 s, case 3 1.1 seconds. So case two is still about twice as slow with MSVC2012 in 64-bit mode. Let me try it with GCC on Linux. – Z boson Mar 25 '14 at 08:36
@Z boson MSVC failed to reuse the result of `div`, I will explain more. Please wait a moment. – Aean Mar 25 '14 at 08:38
Nope, I just booted into Linux and compiled with `g++ -O3 -fopenmp prime.C and I get the same results: case 1 2.0s, case 2 1.0 s, case 3 1.1 s. Case 1 is still twice as slow. I have a E5-1620@3.60GHz (Xeon Sandy Bridge cores). – Z boson Mar 25 '14 at 08:44
Have you checked the assembly output? – Aean Mar 25 '14 at 08:48
No. I'll check later. Just to be clear. I dual boot so I tested MSVC and GCC on the same machine just different OS. – Z boson Mar 25 '14 at 08:57
I modified your `is_prime_A` (and posted it to the end of my question) to use int64_t and I can confirm that's it's the same speed as case 2 and 3 now. I think you almost got it (+1)! But It's not 32-bit vs 64-bit or MSVC vs GCC since I'm only using 64-bit and MSVC and GCC both gave the same result. – Z boson Mar 25 '14 at 10:29
Be careful, that function may have bugs since I did't write it seriously. I just added an `else` before the second `if`. Please double check. – Aean Mar 25 '14 at 10:36
I think you should remove the first half of your answer about 32-bit since it does not apply. Also remove MSVC since both GCC and MSVC have the same problem. This result is strange to me. I expect the compiler to make the best optimization but in this case a simple change to the code makes a big difference. Can you explain why this is? – Z boson Mar 25 '14 at 10:45
The assembly is generated for your test code on my compiler, with `-m32 -O3 -S` flag. If I remove `-m32` flag `case 1` will be the fastest one. So I guessed that's the problem. My MinGW generated exactly the same assembly as you gave, with only one `div`, but MSVC failed with the same code. So I assumed that's an issue of MSVC. I use VS2013 and TDM-GCC 4.8. – Aean Mar 25 '14 at 10:55
Test the code at http://coliru.stacked-crooked.com/a/69497863a97d8953 with MinGW and see what you get. It makes no sense that MinGW on Windows would be so different than GCC on Linux. – Z boson Mar 25 '14 at 11:04
I found the `else` I just added has made the compiler use double `div` again ... So surprise ... I have no idea why the compiler refused to reuse the result from previous iteration. – Aean Mar 25 '14 at 11:27
It still works for me. See the results here http://coliru.stacked-crooked.com/a/476b1b5beec81016. You are on to something! This is interesting. – Z boson Mar 25 '14 at 11:47
This is what I'm wondering. If the `n/p` and `n%p` is in one basic instruction block, without any branches or entries, compiler will generate only one `div` instruction. – Aean Mar 25 '14 at 11:59
In either case it only generates one div (see the original assembly output I posted). I think it has to do with a dependency. Case 1 has a dependency, case 2 does not. Your new code breaks the dependency. I'm just having trouble describing precisely the details. – Z boson Mar 25 '14 at 12:11
I don't think so. It's not about dependency. See the profile result here: http://aean.net/eximg/doublediv.png – Aean Mar 25 '14 at 12:22
I tried your new function you just posted for update. It is as fast as case 2 as well. – Z boson Mar 25 '14 at 12:30
Multiplication is approx. 30 times faster than division on x86. So there is almost no difference between case 1 and case 2. I checked the assembly of case 2, there is still the same data dependency chain. If you are just looking for more faster approach I would give a try. – Aean Mar 25 '14 at 12:50
I just tried `for (p = 3, quo=n/p, rem=n%p; p <= quo; p += 2, quo = n/p, rem=n%p)` and it's as fast as case 2 as well. This is interesting. – Z boson Mar 25 '14 at 12:56
You have explained how to fix case 1 but not exactly why the fix works. Nevertheless, you have the best answer so far. – Z boson Mar 27 '14 at 07:37
The language only require one term about division and modulo operator. `If the quotient a/b is representable, the expression (a/b)*b + a%b shall equal a` (C99 §6.5.5-6). C std doesn't restrict anything about compiler optimization, so one impl can calculate the quotient in any approach as long as the result is correct. As you see, this issue is compiler specific. MSVC / GCC are not making any mistakes here, they are just doing things not as well as MinGW. – Aean Mar 27 '14 at 12:04
You get the bounty. I don't think you really answered the question fully. The division is only done once so I don't think it fails to reuse the result. I highly doubt that MinGW optimizes better than GCC. But you gave a better answer than anyone else and put more effort into it than anyone else. – Z boson Apr 01 '14 at 07:47

score 2 · Answer 2 · answered Mar 21 '14 at 17:25

2

sqrt(n) is accurate enough as long as your sqrt is monotone increasing, it gets perfect squares right, and every unsigned int can be represented exactly as a double. All three of these are the case on every platform I know of.

You can get around these issues (if you consider them to be issues) by implementing a function unsigned int sqrti(unsigned int n) that returns the floor of the square root of an unsigned int using Newton's method. (This is an interesting exercise if you've never done it before!)

answered Mar 21 '14 at 17:25

tmyklebu

13,915
3
28
57

Do you have a source about sqrt getting the perfect squares right? I know unsigned in can be represented exactly as double but uint64_t can't be exactly represented by double therefore I don't see why it should be guaranteed to get the perfect squares right. Good point about doing this myself with Newton's method. I have been thinking of doing that. – Z boson Mar 21 '14 at 17:32
1

`sqrt` on most (all?) major platforms is "correctly rounded"---in round-to-nearest mode, it's always within 1/2 ulp of being right, and in the directed rounding modes, it's still within 1 ulp of being right. That implies that it gets the exact cases right (since getting them wrong would imply a difference of at least 1 ulp). IEEE 754 requires that `sqrt` be provided and correctly-rounded. However, IEEE 754 is not the C standard. – tmyklebu Mar 21 '14 at 22:38

score 2 · Answer 3 · answered Mar 21 '14 at 21:00

An answer to only a small portion of this post.

Case 2 fix to deal with overflow.

#include <limits.h>

int is_prime(unsigned n) {
  unsigned p;
  if (!(n & 1) || n < 2)
    return n == 2;

  #define UINT_MAX_SQRT (UINT_MAX >> (sizeof(unsigned)*CHAR_BIT/2))
  unsigned limit = n;
  if (n >= UINT_MAX_SQRT * UINT_MAX_SQRT)
    limit = UINT_MAX_SQRT * UINT_MAX_SQRT - 1;

  for (p = 3; p * p < limit; p += 2)
    if (!(n % p))
      return 0;

  if (n != limit)
    if (!(n % p))
      return 0;
  return 1;
}

The limit calculation fails if both sizeof(unsigned) and CHAR_BIT are odd - a rare situation.

score 1 · Answer 4 · answered Mar 25 '14 at 23:55

About your first question: why (2) is faster that (1)?
Well, this depends on the compiler, maybe.
However, in general one could expect that a division is a more expensive operation than a multiplication.

About your 2nd question: is sqrt() an accurate function?

In general, it is accurate.
The only case that could give you problems is the one that sqrt(n) is an integer.
For example, if n == 9 and sqrt(n) == 2.9999999999999 in your system, then you are in trouble there, because the integer part is 2, but the exact value is 3.
However, this rare cases can easily handled by adding a not so small double constant like 0.1, say.
Thus, you can write:

  double stop = sqrt(n) + 0.1;  
  for (unsigned int d = 2; d <= stop; d += 2)
       if (n % d == 0)
           break;  /*  not prime!! */

The added term 0.1 could add one iteration to your algorithm, which is not a big issue at all.

Finally, the obvious choice for your algorithm is (3), that is, the sqrt() approach, because there is not any calculation (multiplications or divisions), and the value stop is calculated just once.

Another improvement that you can have is the following:

Note that every prime p >= 5 has the form 6n - 1 or well 6n + 1.

So, you can alternate the increments of the variable d being 2, 4, 2, 4, and so on.

Multiplication being faster than division is NOT the reason case 1 is slower than case 2. If I save the quotient and remainder case 1 is slightly faster than case 2. The sqrt function obey's IEEE 754 and therefore will get perfect squares (such as sqrt(9)) so no correction to the sqrt is needed. I agree case 3 is the best in general. I already use a 30 spoke wheel to exclude multiples of 2, 3, and 5 in my own code. — Z boson, Mar 26 '14 at 07:07
For proper rounding, you should use: `unsigned int stop = (unsigned int)(sqrt(n) + 0.5);` Also you don't have to do a conversion every iteration loop test. — Jesse Chisholm, Apr 09 '19 at 20:10

Conditional tests in primality by trial division

4 Answers4

Linked