Optimize me! (C, performance) -- followup to bit-twiddling question

Question

Thanks to some very helpful stackOverflow users at Bit twiddling: which bit is set?, I have constructed my function (posted at the end of the question).

Any suggestions -- even small suggestions -- would be appreciated. Hopefully it will make my code better, but at the least it should teach me something. :)

Overview

This function will be called at least 10¹³ times, and possibly as often as 10¹⁵. That is, this code will run for months in all likelihood, so any performance tips would be helpful.

This function accounts for 72-77% of the program's time, based on profiling and about a dozen runs in different configurations (optimizing certain parameters not relevant here).

At the moment the function runs in an average of 50 clocks. I'm not sure how much this can be improved, but I'd be thrilled to see it run in 30.

Key Observation

If at some point in the calculation you can tell that the value that will be returned will be small (exact value negotiable -- say, below a million) you can abort early. I'm only interested in large values.

This is how I hope to save the most time, rather than by further micro-optimizations (though these are of course welcome as well!).

Performance Information

smallprimes is a bit array (64 bits); on average about 8 bits will be set, but it could be as few as 0 or as many as 12.
q will usually be nonzero. (Notice that the function exits early if q and smallprimes are zero.)
r and s will often be 0. If q is zero, r and s will be too; if r is zero, s will be too.
As the comment at the end says, nu is usually 1 by the end, so I have an efficient special case for it.
The calculations below the special case may appear to risk overflow, but through appropriate modeling I have proved that, for my input, this will not occur -- so don't worry about that case.
Functions not defined here (ugcd, minuu, star, etc.) have already been optimized; none take long to run. pr is a small array (all in L1). Also, all functions called here are pure functions.
But if you really care... ugcd is the gcd, minuu is the minimum, vals is the number of trailing binary 0s, __builtin_ffs is the location of the leftmost binary 1, star is (n-1) >> vals(n-1), pr is an array of the primes from 2 to 313.
The calculations are currently being done on a Phenom II 920 x4, though optimizations for i7 or Woodcrest are still of interest (if I get compute time on other nodes).
I would be happy to answer any questions you have about the function or its constituents.

What it actually does

Added in response to a request. You don't need to read this part.

The input is an odd number n with 1 < n < 4282250400097. The other inputs provide the factorization of the number in this particular sense:

smallprimes&1 is set if the number is divisible by 3, smallprimes&2 is set if the number is divisible by 5, smallprimes&4 is set if the number is divisible by 7, smallprimes&8 is set if the number is divisible by 11, etc. up to the most significant bit which represents 313. A number divisible by the square of a prime is not represented differently from a number divisible by just that number. (In fact, multiples of squares can be discarded; in the preprocessing stage in another function multiples of squares of primes <= lim have smallprimes and q set to 0 so they will be dropped, where the optimal value of lim is determined by experimentation.)

q, r, and s represent larger factors of the number. Any remaining factor (which may be greater than the square root of the number, or if s is nonzero may even be less) can be found by dividing factors out from n.

Once all the factors are recovered in this way, the number of bases, 1 <= b < n, to which n is a strong pseudoprime are counted using a mathematical formula best explained by the code.

Improvements so far

Pushed the early exit test up. This clearly saves work so I made the change.
The appropriate functions are already inline, so __attribute__ ((inline)) does nothing. Oddly, marking the main function bases and some of the helpers with __attribute ((hot)) hurt performance by almost 2% and I can't figure out why (but it's reproducible with over 20 tests). So I didn't make that change. Likewise, __attribute__ ((const)), at best, did not help. I was more than slightly surprised by this.

Code

ulong bases(ulong smallprimes, ulong n, ulong q, ulong r, ulong s)
{
    if (!smallprimes & !q)
        return 0;

    ulong f = __builtin_popcountll(smallprimes) + (q > 1) + (r > 1) + (s > 1);
    ulong nu = 0xFFFF;  // "Infinity" for the purpose of minimum
    ulong nn = star(n);
    ulong prod = 1;

    while (smallprimes) {
        ulong bit = smallprimes & (-smallprimes);
        ulong p = pr[__builtin_ffsll(bit)];
        nu = minuu(nu, vals(p - 1));
        prod *= ugcd(nn, star(p));
        n /= p;
        while (n % p == 0)
            n /= p;
        smallprimes ^= bit;
    }
    if (q) {
        nu = minuu(nu, vals(q - 1));
        prod *= ugcd(nn, star(q));
        n /= q;
        while (n % q == 0)
            n /= q;
    } else {
        goto BASES_END;
    }
    if (r) {
        nu = minuu(nu, vals(r - 1));
        prod *= ugcd(nn, star(r));
        n /= r;
        while (n % r == 0)
            n /= r;
    } else {
        goto BASES_END;
    }
    if (s) {
        nu = minuu(nu, vals(s - 1));
        prod *= ugcd(nn, star(s));
        n /= s;
        while (n % s == 0)
            n /= s;
    }

    BASES_END:
    if (n > 1) {
        nu = minuu(nu, vals(n - 1));
        prod *= ugcd(nn, star(n));
        f++;
    }

    // This happens ~88% of the time in my tests, so special-case it.
    if (nu == 1)
        return prod << 1;

    ulong tmp = f * nu;
    long fac = 1 << tmp;
    fac = (fac - 1) / ((1 << f) - 1) + 1;
    return fac * prod;
}

Can you explain what the code is actually trying to compute? I'm terrible at number theory, but someone who's better at it might be able to see a bit algorithmic improvement that you could make. — celion, Aug 17 '10 at 07:25
I've added an explanation. See also http://oeis.org/classic/A141768 for some background. It's hard to imagine that someone could optimize it on that basis at this point, but I'd be grateful if someone proved me wrong on that point! — Charles, Aug 17 '10 at 14:15
Someone with the rep really needs to create a mature-optimization tag reserved for situations such as this... — DonaldRay, Aug 17 '10 at 17:29
So... `vals()` is equivalent to `__builtin_ctz()`? Or is it different by being defined for a zero argument? — slacker, Aug 18 '10 at 15:57
At this point it's a wrapper for __builtin_ctz() for systems that have it, yes. — Charles, Aug 19 '10 at 02:31
Isn't the *tmp* variable useless? It is used only once. The code would be slightly more readable without it :). — slacker, Aug 19 '10 at 14:04
@slacker: Yes... and no. I mentioned that the calculation `1 << tmp` is guaranteed to not overflow in this implementation, but I wrote code to handle the case when it did. Because right now that doesn't apply that code is commented out, but I left it in place in case I later generalize it. Good catch, though! — Charles, Aug 19 '10 at 17:54
Surely the `goto BASES_END` is unnecessary, just put the code after the else into the preceding if block. Maybe the compiler did this, but you might want to check that. — Skizz, Aug 24 '10 at 15:46
No, it's there to prevent testing that isn't needed. If q is 0, r and s should not be tested. I'm assuming gcc is smart enough to change the double `jmp` to a single `jmp`, of course. I suppose I could have written it as a switch statement with fallthrough, though. — Charles, Aug 24 '10 at 17:55
@Charles: I meant this instead: `if (q) { ... if (r) { ... if (s) { ... } } }` so the `if (r)` is only tested if the `if (q)` is true and the same goes for the `if (s)`. — Skizz, Sep 01 '10 at 14:31
Thanks all! I've now run the program to the limit of the built-in format. I'm in the process of a partial rewrite to allow it to go higher. — Charles, Sep 13 '10 at 19:30
Just a side node, there is currently a beta stack exchange site for code review that you may like to check out http://codereview.stackexchange.com — Manatherin, Feb 21 '11 at 14:22
Charles, I'd be highly interested in how much the suggestions given in the answers to your question actually improved the runtime (did you get anywhere close to 30 clocks)? Maybe you could update your question with some (possibly preliminary) results? — Frerich Raabe, Apr 14 '11 at 07:27
@Frerich Raabe: I got it to about 40 clocks, which was fast enough for me to finish the calculation (!) to the design limits of my data structure. To calculate further I'd need to rewrite that, but I haven't felt the need. — Charles, Apr 14 '11 at 13:10

slacker · Accepted Answer · 2010-08-19T18:16:48.680

You seem to be wasting much time doing divisions by the factors. It is much faster to replace a division with a multiplication by the reciprocal of divisor (division: ~15-80(!) cycles, depending on the divisor, multiplication: ~4 cycles), IF of course you can precompute the reciprocals.

While this seems unlikely to be possible with q, r, s - due to the range of those vars, it is very easy to do with p, which always comes from the small, static pr[] array. Precompute the reciprocals of those primes and store them in another array. Then, instead of dividing by p, multiply by the reciprocal taken from the second array. (Or make a single array of structs.)

Now, obtaining exact division result by this method requires some trickery to compensate for rounding errors. You will find the gory details of this technique in this document, on page 138.

EDIT:

After consulting Hacker's Delight (an excellent book, BTW) on the subject, it seems that you can make it even faster by exploiting the fact that all divisions in your code are exact (i.e. remainder is zero).

It seems that for every divisor d which is odd and base B = 2^word_size, there exists a unique multiplicative inverse d⃰ which satisfies the conditions: d⃰ < B and d·d⃰ ≡ 1 (mod B). For every x which is an exact multiple of d, this implies x/d ≡ x·d⃰ (mod B). Which means you can simply replace a division with a multiplication, no added corrections, checks, rounding problems, whatever. (The proofs of these theorems can be found in the book.) Note that this multiplicative inverse need not be equal to the reciprocal as defined by the previous method!

How to check whether a given x is an exact multiple of d - i.e. x mod d = 0 ? Easy! x mod d = 0 iff x·d⃰ mod B ≤ ⌊(B-1)/d⌋. Note that this upper limit can be precomputed.

So, in code:

unsigned x, d;
unsigned inv_d = mulinv(d);          //precompute this!
unsigned limit = (unsigned)-1 / d;   //precompute this!

unsigned q = x*inv_d;
if(q <= limit)
{
   //x % d == 0
   //q == x/d
} else {
   //x % d != 0
   //q is garbage
}

Assuming the pr[] array becomes an array of struct prime:

struct prime {
   ulong p;
   ulong inv_p;  //equal to mulinv(p)
   ulong limit;  //equal to (ulong)-1 / p
}

the while(smallprimes) loop in your code becomes:

while (smallprimes) {
    ulong bit = smallprimes & (-smallprimes);
    int bit_ix = __builtin_ffsll(bit);
    ulong p = pr[bit_ix].p;
    ulong inv_p = pr[bit_ix].inv_p;
    ulong limit = pr[bit_ix].limit;
    nu = minuu(nu, vals(p - 1));
    prod *= ugcd(nn, star(p));
    n *= inv_p;
    for(;;) {
        ulong q = n * inv_p;
        if (q > limit)
            break;
        n = q;
    }
    smallprimes ^= bit;
}

And for the mulinv() function:

ulong mulinv(ulong d) //d needs to be odd
{
   ulong x = d;
   for(;;)
   {
      ulong tmp = d * x;
      if(tmp == 1)
         return x;
      x *= 2 - tmp;
   }
}

Note you can replace ulong with any other unsigned type - just use the same type consistently.

The proofs, whys and hows are all available in the book. A heartily recommended read :-).

Thank you! Looking into this now. – Charles Aug 18 '10 at 23:41 — Charles, Aug 18 '10 at 23:41

score 5 · Answer 2 · answered Aug 17 '10 at 07:40

5

If your compiler supports GCC function attributes, you can mark your pure functions with this attribute:

ulong star(ulong n) __attribute__ ((const));

This attribute indicates to the compiler that the result of the function depends only on its argument(s). This information can be used by the optimiser.

Is there a reason why you've opencoded vals() instead of using __builtin_ctz() ?

answered Aug 17 '10 at 07:40

caf

233,326
40
323
462

@caf: isn't there also a `pure` attribute? – Jens Gustedt Aug 17 '10 at 07:49
1

@Jens Gustedt: Yes, but `pure` is less strict than `const` (`pure` functions are allowed to access global variable, `const` ones are not). – caf Aug 17 '10 at 07:54
Good point. I suppose I'll put `__attribute__ ((hot))` in too. – Charles Aug 17 '10 at 15:08
I think all of those attributes only matter for out-of-line functions. What you really want is `always_inline` – celion Aug 17 '10 at 17:47
Yes -- sorry, I meant hot for the main function above. The helper functions already have inline directives. – Charles Aug 18 '10 at 01:00
2

Careful with `always_inline` more often than not it is counterproductive. The optimizer often knows more about the tradeoff between I$ misses and call overhead than the programmer. I have a collegue who is using all times, I tried his code with the option `-fno-inline` which overrides even `always_inline` and the resulting program was 10% faster (and that on SPARC with it expensive windowed registers calls). – Patrick Schlüter Aug 25 '10 at 14:41

abc · Answer 3 · 2010-08-24T15:31:02.527

It is still somewhat unclear, what you are searching for. Quite frequently number theoretic problems allow huge speedups by deriving mathematical properties that the solutions must satisfiy.

If you are indeed searching for the integers that maximize the number of non-witnesses for the MR test (i.e. oeis.org/classic/A141768 that you mention) then it might be possible to use that the number of non-witnesses cannot be larger than phi(n)/4 and that the integers for which have this many non-witnesses are either are the product of two primes of the form

(k+1)*(2k+1)

or they are Carmichael numbers with 3 prime factors. I'd think above some limit all integers in the sequence have this form and that it is possible to verify this by proving an upper bound for the witnesses of all other integers. E.g. integers with 4 or more factors always have at most phi(n)/8 non-witnesses. Similar results can be derived from you formula for the number of bases for other integers.

As for micro-optimizations: Whenever you know that an integer is divisible by some quotient, then it is possible to replace the division by a multiplication with the inverse of the quotient modulo 2^64. And the tests n % q == 0 can be replaced by a test

n * inverse_q < max_q,

where inverse_q = q^(-1) mod 2^64 and max_q = 2^64 / q. Obviously inverse_q and max_q need to be precomputed, to be efficient, but since you are using a sieve, I assume this should not be an obstacle.

I will look very seriously into this. I have not been well these past few days; apologies for appearing to ignore your thoughtful comment. — Charles, Aug 20 '10 at 22:54
@Charles I tend to think that running your program to find the values of the sequence A141768 is a waste of time. But I also think that your program could be used to find the average number of bases that pass the Miller-Rabin test for a random k-bit integer. Comparing the results with the paper by Damgård, Landrock, Pomerance, "Average case error estimates for the strong probable prime test" would be interesting. But, of course that is only my own opinion. — abc, Aug 24 '10 at 15:24
Of course the search is partially inspired by DLP. But I don't think this method will be much use for checking their figures, since the area of interest is perhaps 300 to 3000 bits, where factoring is difficult and sieving impossible. — Charles, Aug 24 '10 at 17:51

score 4 · Answer 4 · answered Aug 17 '10 at 06:34

4

Small optimization but:

ulong f;
ulong nn;
ulong nu = 0xFFFF;  // "Infinity" for the purpose of minimum
ulong prod = 1;

if (!smallprimes & !q)
    return 0;

// no need to do this operations before because of the previous return
f = __builtin_popcountll(smallprimes) + (q > 1) + (r > 1) + (s > 1);
nn = star(n);

BTW: you should edit your post to add star() and other functions you use definition

answered Aug 17 '10 at 06:34

I edited my question as you were typing to describe the functions. The exact code shouldn't be too important -- they've been optimized already. But maybe I'll post them anyway; I just didn't want to drive anyone off with too long a code block. – Charles Aug 17 '10 at 06:36
Oh... this might seem like a small optimization (and it was certainly stupid for me to miss it!), but over the life of my calculation I estimate this will save 200 trillion cycles. – Charles Aug 17 '10 at 06:40

ergosys · Answer 5 · 2010-08-17T08:45:35.363

4

Try replacing this pattern (for r and q too):

n /= p; 
while (n % p == 0) 
  n /= p;

With this:

ulong m;
  ...
m = n / p; 
do { 
  n = m; 
  m = n / p; 
} while ( m * p == n);

In my limited tests, I got a small speedup (10%) from eliminating the modulo.

Also, if p, q or r were constant, the compiler will replace the divisions by multiplications. If there are few choices for p, q or r, or if certain ones are more frequent, you might gain something by specializing the function for those values.

edited Aug 17 '10 at 08:45

answered Aug 17 '10 at 08:10

ergosys

47,835
5
49
70

I'll try that. Unfortunately none of p, q, or r are constant and the values can vary widely (from 317 to several million). – Charles Aug 17 '10 at 13:51
2

Having the one assembly DIV instruction for the division (quotient) and modulo (remainder) at the same time would be faster that a division and a multiplication. – Toad Aug 17 '10 at 17:22
I was also going to suggest avoiding using % as well, it's just as expensive as dividion, where as multiplication is usually quite fast. – Axman6 Apr 14 '11 at 07:10

score 3 · Answer 6 · answered Aug 17 '10 at 22:58

Have you tried using profile-guided optimisation?

Compile and link the program with the -fprofile-generate option, then run the program over a representative data set (say, a day's worth of computation).

Then re-compile and link it with the -fprofile-use option instead.

Toad · Answer 7 · 2010-08-17T07:54:14.933

1) I would make the compiler spit out the assembly it generates and try and deduce if what it does is the best it can do... and if you spot problems, change the code so the assembly looks better. This way you can also make sure that functions you hope it'll inline (like star and vals) are really inlined. (You might need to add pragma's, or even turn them into macros)

2) It's great that you try this on a multicore machine, but this loop is singlethreaded. I'm guessing that there is an umbrella functions which splits the load across a few threads so that more cores are used?

3) It's difficult to suggest speed ups if what the actual function tries to calculate is unclear. Typically the most impressive speedups are not achieved with bit twiddling, but with a change in the algorithm. So a bit of comments might help ;^)

4) If you really want a speed up of 10* or more, check out CUDA or openCL which allows you to run C programs on your graphics hardware. It shines with functions like these!

5) You are doing loads of modulo and divides right after each other. In C this is 2 separate commands (first '/' and then '%'). However in assembly this is 1 command: 'DIV' or 'IDIV' which returns both the remainder and the quotient in one go:

B.4.75 IDIV: Signed Integer Divide

IDIV r/m8                     ; F6 /7                [8086]
IDIV r/m16                    ; o16 F7 /7            [8086]
IDIV r/m32                    ; o32 F7 /7            [386]

IDIV performs signed integer division. The explicit operand provided is the divisor; the dividend and destination operands are implicit, in the following way:

For IDIV r/m8, AX is divided by the given operand; the quotient is stored in AL and the remainder in AH.

For IDIV r/m16, DX:AX is divided by the given operand; the quotient is stored in AX and the remainder in DX.

For IDIV r/m32, EDX:EAX is divided by the given operand; the quotient is stored in EAX and the remainder in EDX.

So it will require some inline assembly, but I'm guessing there'll be a significant speedup as there are a few places in your code which can benefit from this.

These are great suggestions (+1), but I don't really think they help me much here. For #4, CUDA would be pretty good at this task, but rewriting the program for it would be non-trivial, since information going to the GPU travels through a small pipe (and so to do it efficiently I'd have to move over a lot of my program, not seen here, to CUDA). Also, that takes a lot of skill I don't have! For #3, I've spent a long time designing the algorithm for this particular task; I'm more mathematician than programmer. Your guess in #2 is accurate -- I'll run four copies of the program. — Charles, Aug 17 '10 at 06:54
#1 is a good thought, and certainly worthwhile. My assembly is more than a little rusty, but I have a nice manual handy if I get stuck. I'll let you know if I have any progress with this. — Charles, Aug 17 '10 at 06:55
@charles Don't underestimate yourself. Cuda runs C programs, so a lot of the skills you already possess. It's true that it would need some re-architecting for dataflow. It's certainly non-trivial but the rewards would be very sweet. (10* speed up is typical... twice as fast is not uncommon) — Toad, Aug 17 '10 at 06:59
I will look at the generated assembly. I thought that gcc was smart enough to emit only one IDIV in cases like these, but perhaps not. — Charles, Aug 17 '10 at 14:36
@Charles: If you create a small function to perform mod-and-div: `void moddiv(int n, int m, int *d, int *r) { *d = n / m; *r = n % m; }` then that seems to help gcc see that it can use just one `IDIV` (that function gets inlined, and the indirection gets optimised out). — caf, Aug 17 '10 at 23:53
Um boyz, `div` is in the standard library of C, no need to reemplement it. I suppose GNU-C is able to compile it as intrinsic that way. http://www.gnu.org/software/libtool/manual/libc/Integer-Division.html — Patrick Schlüter, Aug 25 '10 at 14:53

celion · Answer 8 · 2010-08-17T07:56:00.143

2

Make sure your functions get inlined. If they're out-of-line, the overhead might add up, especially in the first while loop. The best way to be sure is to examine the assembly.
Have you tried pre-computing star( pr[__builtin_ffsll(bit)] ) and vals( pr[__builtin_ffsll(bit)] - 1) ? That would trade some simple work for an array lookup, but it might be worth it if the tables are small enough.
Don't compute f until you actually need it (near the end, after your early-out). You can replace the code around BASES_END with something like

BASES_END:
ulong addToF = 0;
if (n > 1) {
    nu = minuu(nu, vals(n - 1));
    prod *= ugcd(nn, star(n));
    addToF = 1;
}
// ... early out if nu == 1...
// ... compute f ...
f += addToF;

Hope that helps.

edited Aug 17 '10 at 07:56

answered Aug 17 '10 at 07:36

celion

3,864
25
19

Very helpful suggestions! Yes, those will surely be worthwhile to precompute. – Charles Aug 17 '10 at 13:53
The smallprimes loop destroys the values in smallprimes, so I'll still need to do popcountll(smallprimes) at the top. But moving the rest down should save me 50 trillion cycles. – Charles Aug 17 '10 at 17:46
Make sure you test it well. Trading a few bit shifts for a cache miss isn't a good trade. – celion Aug 17 '10 at 17:47
Or you could just copy the original value of smallprimes. I'm not sure what popcountll actually compiles down to. – celion Aug 17 '10 at 17:49
1

If you tell the compiler to use the latest and greatest instruction set extensions (which I assume OP is doing, given his need for performance), then `popcountll()` will compile to a single POPCNT instruction, which takes 2 clock cycles on AMD, 3 cycles on Intel. – slacker Aug 19 '10 at 06:46

Jens Gustedt · Answer 9 · 2010-08-17T08:22:57.760

2

First some nitpicking ;-) you should be more careful about the types that you are using. In some places you seem to assume that ulong is 64 bit wide, use uint64_t there. And also for all other types, rethink carefully what you expect of them and use the appropriate type.

The optimization that I could see is integer division. Your code does that a lot, this is probably the most expensive thing you are doing. Division of small integers (uint32_t) maybe much more efficient than by big ones. In particular for uint32_t there is an assembler instruction that does division and modulo in one go, called divl.

If you use the appropriate types your compiler might do that all for you. But you'd better check the assembler (option -S to gcc) as somebody already said. Otherwise it is easy to include some little assembler fragments here and there. I found something like that in some code of mine:

register uint32_t a asm("eax") = 0;
register uint32_t ret asm("edx") = 0;
asm("divl %4"
    : "=a" (a), "=d" (ret)
    : "0" (a), "1" (ret), "rm" (divisor));

As you can see this uses special registers eax and edx and stuff like that...

edited Aug 17 '10 at 08:22

answered Aug 17 '10 at 07:48

Jens Gustedt

76,821
6
102
177

In this application, there are #ifdef statements that guarantee that this code is never seen except in the 64-bit case (and the ulong is a typedef, I think). – Charles Aug 17 '10 at 15:04
@Charles: ok, but this is only one part of the problem. Use smaller types where you don't need the full width. First, this might release pressure from your optimizer, since registers are still a scarce resource. And then, as for the `ldiv` in my example there might be faster assembler operations available for smaller types. And when doing so, it is much easier to read `uint32_t` mixed with `uint64_t` then `ulong` ;-) – Jens Gustedt Aug 17 '10 at 15:34
I may be able to guarantee that q, r, and s can fit in a uint32_t, so in that case this may save some serious time. (I'll check the actual performance difference and get back to you.) On ulong, I'm just following the coding style of the other 200,000 LOC... – Charles Aug 17 '10 at 15:48

score 1 · Answer 10 · answered Aug 17 '10 at 06:44

1

Did you try a table lookup version of the first while loop? You could divide smallprimes in 4 16 bit values, look up their contribution and merge them. But maybe you need the side effects.

answered Aug 17 '10 at 06:44

No side-effects _per se_, all functions called from here are pure. So maybe lookup is possible with multiple tables... not sure. Also: I edited this clarification into my question since I wasn't clear originally. – Charles Aug 17 '10 at 06:46

vhallac · Answer 11 · 2010-08-18T20:46:47.610

Did you try passing in an array of primes instead of splitting them in smallprimes, q, r and s? Since I don't know what the outer code does, I am probably wrong, but there is a chance that you also have a function to convert some primes to a smallprimes bitmap, and inside this function, you convert the bitmap back to an array of primes, effecively. In addition, you seem to do identical processing for elements of smallprimes, q, r, and s. It should save you a tiny amount of processing per call.

Also, you seem to know that the passed in primes divide n. Do you have enough knowledge outside about the power of each prime that divides n? You could save a lot of time if you can eliminate the modulo operation by passing in that information to this function. In other words, if n is pow(p_0,e_0)*pow(p_1,e_1)*...*pow(p_k,e_k)*n_leftover, and if you know more about these e_is and n_leftover, passing them in would mean a lot of things you don't have to do in this function.

There may be a way to discover n_leftover (the unfactored part of n) with less number of modulo operations, but it is only a hunch, so you may need to experiment with it a bit. The idea is to use gcd to remove known factors from n repeatedly until you get rid of all known prime factors. Let me give some almost-c-code:

factors=p_0*p_1*...*p_k*q*r*s;
n_leftover=n/factors;
do {
    factors=gcd(n_leftover, factors);
    n_leftover = n_leftover/factors;
} while (factors != 1);

I am not at all certain this will be better than the code you have, let alone the combined mod/div suggestions you can find in other answers, but I think it is worth a try. I feel that it will be a win, especially for numbers with high numbers of small prime factors.

This sounds like it could be a win to me, too - rather than passing in a bitmap, pass in a pointer to a an array, where each element of the array is the power of the corresponding prime in the factorisation of `n`. — caf, Aug 17 '10 at 23:12
Yeah, I can't really do that. I'm already spending 2 GB on storing the numbers as bitmaps, if I stored them as arrays I couldn't store as many numbers in memory so I'd have to spend more time sieving. This would be a significant cost. — Charles, Aug 18 '10 at 01:03
Fair enough. If memory is an issue, then using the bitmap is definitely a necessity. — vhallac, Aug 18 '10 at 07:31
The sieve in my answer actually produces a linked list of primes that divide n, but not the exponents. It only consumes a 4MB so it should fit in cache :-) — phkahler, Aug 18 '10 at 19:30
You seem to have sensed what I was thinking - about caches, but I thought I refrained from writing it down. :) In any case, given that you don't know the exponents, I've added an algorithm that can discover the unfactored part using gcd instead of modulo (edited the answer). You may want to give it a try. — vhallac, Aug 18 '10 at 20:48

phkahler · Answer 12 · 2010-08-19T00:36:44.520

You're passing in the complete factorization of n, so you're factoring consecutive integers and then using the results of that factorization here. It seems to me that you might benefit from doing some of this at the time of finding the factors.

BTW, I've got some really fast code for finding the factors you're using without doing any division. It's a little like a sieve but produces factors of consecutive numbers very quickly. Can find it and post if you think it may help.

edit had to recreate the code here:

#include 
#define SIZE (1024*1024) //must be 2^n
#define MASK (SIZE-1)
typedef struct {
    int p;
    int next;
} p_type;

p_type primes[SIZE];
int sieve[SIZE];

void init_sieve()
{
    int i,n;
    int count = 1;
    primes[1].p = 3;
    sieve[1] = 1;
    for (n=5;SIZE>n;n+=2)
    {
        int flag = 0;
        for (i=1;count>=i;i++)
        {
            if ((n%primes[i].p) == 0)
            {
                flag = 1;
                break;
            }
        }
        if (flag==0)
        {
            count++;
            primes[count].p = n;
            sieve[n>>1] = count;
        }
    }
}

int main()
{
    int ptr,n;
    init_sieve();
    printf("init_done\n");

    // factor odd numbers starting with 3
    for (n=1;1000000000>n;n++)
    {
        ptr = sieve[n&MASK];
        if (ptr == 0) //prime
        {
//            printf("%d is prime",n*2+1);
        }
        else //composite
        {
//            printf ("%d has divisors:",n*2+1);
            while(ptr!=0)
            {
//                printf ("%d ",primes[ptr].p);
                sieve[n&MASK]=primes[ptr].next;
                //move the prime to the next number it divides
                primes[ptr].next = sieve[(n+primes[ptr].p)&MASK];
                sieve[(n+primes[ptr].p)&MASK] = ptr;
                ptr = sieve[n&MASK];
            }
        }
//        printf("\n");
    }
    return 0;
}

The init function creates a factor base and initializes the sieve. This takes about 13 seconds on my laptop. Then all numbers up to 1 billion are factored or determined to be prime in another 25 seconds. Numbers less than SIZE are never reported as prime because they have 1 factor in the factor base, but that could be changed.

The idea is to maintain a linked list for every entry in the sieve. Numbers are factored by simply pulling their factors out of the linked list. As they are pulled out, they are inserted into the list for the next number that will be divisible by that prime. This is very cache friendly too. The sieve size must be larger than the largest prime in the factor base. As is, this sieve could run up to 2**40 in about 7 hours which seems to be your target (except for n needing to be 64 bits).

Your algorithm could be merged into this to make use of the factors as they are identified rather than packing bits and large primes into variables to pass to your function. Or your function could be changed to take the linked list (you could create a dummy link to pass in for the prime numbers outside the factor base).

Hope it helps.

BTW, this is the first time I've posted this algorithm publicly.

I am sieving to generate the factors, yes. (It's actually consecutive odd numbers -- even numbers are of no interest in this problem.) The program only ends up spending about a quarter of its time in the sieve and the rest here. But if your code is clever, feel free to post it. — Charles, Aug 17 '10 at 17:38

score 1 · Answer 13 · answered Aug 19 '10 at 04:28

1

just a thought but maybe using your compilers optimization options would help, if you haven't already. another thought would be that if money isn't an issue you could use the Intel C/C++ compiler, assuming your using an Intel processor. I'd also assume that other processor manufacturers (AMD, etc.) would have similar compilers

answered Aug 19 '10 at 04:28

romejoe

423
1
5
16

The key optimizations I'm using at the moment are `-O3 -fno-strict-aliasing -fomit-frame-pointer`. So as not to muck with the rest of the program, and further compiler optimization will probably have to be function attributes. icc isn't possible as I'm using a Phenom II 920 (you probably missed this in my wall of text above...). – Charles Aug 19 '10 at 06:44
@Charles: Did you tell the compiler to optimize specifically for this CPU model? A small `-march=amdfam10` can make wonders. – slacker Aug 19 '10 at 10:34
@Charles: Note that if you don't specify the minimal architecture with `-march`, the compiler won't emit any instructions which are not supported by older CPUs - you're down to SSE2 on x86-64. This means no POPCNT or LZCNT, among others. – slacker Aug 19 '10 at 10:42
If you don't want to mess with the rest of the program couldn't you put the function in a library by itself and max out the optimizations on the library then compile the rest of the program normally also I found this from AMD I think but idk how it works because I don't use AMD chips. http://developer.amd.com/cpu/open64/pages/default.aspx – romejoe Aug 19 '10 at 14:56
I believe I'm specifying the architecture, but I'll check. If not this could save me some time! – Charles Aug 19 '10 at 17:55

score 1 · Answer 14 · answered Sep 03 '10 at 22:45

If you are going to exit immediately on (!smallprimes&!q) why not do that test before even calling the function, and save the function call overhead?

Also, it seems like you effectively have 3 different functions which are linear except for the smallprimes loop. bases1(s,n,q), bases2(s,n,q,r), and bases3(s,n,q,r,s).

It might be a win to actually create those as 3 separate functions without the branches and gotos, and call the appropriate one:

if (!(smallprimes|q)) { r = 0;}
else if (s) { r = bases3(s,n,q,r,s);}
else if (r) { r = bases2(s,n,q,r); }
else        { r = bases1(s,n,q);

This would be most effective if previous processing has already given the calling code some 'knowledge' of which function to execute and you don't have to test for it.

There's no function overhead; the function will be inlined (-finline-functions-called-once is included in -O1 and above). The other recommendation would be good, except that the calling function won't have any knowledge. — Charles, Sep 05 '10 at 03:02

score 1 · Answer 15 · edited May 23 '17 at 12:07

1

If the divisions you're using are with numbers that aren’t known at compile time, but are used frequently at runtime (dividing by the same number many times), then I would suggest using the libdivide library, which basically implements at runtime the optimisations that compilers do for compile time constants (using shifts masks etc.). This can provide a huge benefit. Also avoiding using x % y == 0 for something like z = x/y, z * y == x as ergosys suggested above should also have a measurable improvement.

edited May 23 '17 at 12:07

Community

1
1

answered Apr 14 '11 at 07:17

Axman6

909
5
16

+1 for the idea, but it doesn't help -- it's many numbers over a wide range, each used once. – Charles Apr 14 '11 at 13:11
Ah that's a shame. Oh well, libdivide is still a useful tool for you to know about. The author has an excellent blog too: [ridiculousfish.com](http://ridiculousfish.com/). – Axman6 Apr 19 '11 at 00:55

phuclv · Answer 16 · 2013-08-03T12:23:40.427

Does the code on your top post is the optimized version? If yes, there is still too many divide operations which greatly eat CPU cycles.

This code is overexecute innecessarily a bit

if (!smallprimes & !q)
    return 0;

change to logical and &&

if (!smallprimes && !q)
    return 0;

will make it short circuited faster without eveluating q

And the following code

ulong bit = smallprimes & (-smallprimes);
ulong p = pr[__builtin_ffsll(bit)];

which is used to find the last set bit of smallprimes. Why don't you use the simpler way

ulong p = pr[__builtin_ctz(smallprimes)];

Another culprit for decreased performance maybe too many program branching. You may consider changing to some other less-branch or branch-less equivalents

The additional branch in `!smallprimes && !q` actually makes it more expensive than `!smallprimes & !q`. As for the other, I need the value of `bit` later in the code, and if I made the query directly on `smallprimes` the array would have length 2^64 and thus not fit in memory. — Charles, Aug 04 '13 at 16:15