What is faster in C++: mod (%) or another counter?

Question

At the risk of this being a duplicate, maybe I just can't find a similar post right now:

I am writing in C++ (C++20 to be specific). I have a loop with a counter that counts up every turn. Let's call it counter. And if this counter reaches a page-limit (let's call it page_limit), the program should continue on the next page. So it looks something like this:

const size_t page_limit = 4942;
size_t counter = 0;
while (counter < foo) {
    if (counter % page_limit == 0) {
        // start new page
    }
    
    // some other code
    counter += 1;
}

Now I am wondering since the counter goes pretty high: would the program run faster, if I wouldn't have the program calculate the modulo counter % page_limit every time, but instead make another counter? It could look something like this:

const size_t page_limit = 4942;
size_t counter = 0;
size_t page_counter = 4942;
while (counter < foo) {
    if (page_counter == page_limit) {
        // start new page
        page_counter = 0;
    }

    // some other code
    counter += 1;
    page_counter += 1;
}

Use a benchmark utility like `googlebenchmark` and find out. — NathanOliver, Dec 28 '20 at 16:49
I would expect the counter is faster however in either case will you even notice the difference in performance. — drescherjm, Dec 28 '20 at 16:50
This would be a micro-optimization - modern compilers optimize integer modulo operations by using some crazy CPU instructions I've never even heard-of - so I think you're wasting your time by asking this question. You should also check GodBolt.org before posting any compiler optimization questions. — Dai, Dec 28 '20 at 16:50
`if (counter % page_limit)` would start a new page when `counter && counter != page_limit` which is not what the second code snippet does. Be sure that the two versions have the same result before benchmarking at https://quick-bench.com/ — Ted Lyngmo, Dec 28 '20 at 16:55
General rule of thumb when optimizing code: Are you calling this in excess of a billion times? Does it cause a *measurable* performance drag if you make it deliberately slower, such as `if (x % y || x % y || x % y ...)` repeated 20 times? If not, move on, it's not a problem. — tadman, Dec 28 '20 at 16:55
IMO it depends on what architecture you're running on, and how aggressively you optimize. I think modern compilers will use the CMOV instruction to avoid a branch with the second option. (Which is great because it avoids the cost of branch prediction). In general I'd guess that equality is faster than % (because the % instruction needs to do more work) but it really depends on your architecture. — Henry, Dec 28 '20 at 16:58
@Henry Depends on what work is involved in starting the new page. Assuming something else needs to be done, its probably going to need a branch anyway. And since the compilier should be using multiplicative inverse, its only a couple of cycles slower, not order of magnitude of an idiv. — user1937198, Dec 28 '20 at 17:04
you should put correctness before premature optimizations. `if (counter % page_limit)` is probably not what you want. Your two snippets do different things, hence comparing their performance is not very meaningful. — 463035818_is_not_an_ai, Dec 28 '20 at 17:13
Probably the "start a new page" code is a lot more costly than % or mod. In that case, this optimization is insignificant. — Rick James, Dec 28 '20 at 17:29
@tadman: that's a good suggestion in general (to make something deliberately slower and see if it's measurable), but `if (x % y == 0 || x % y == 0 || x % y == 0 | ...)` isn't going to make it slower. Any decent compiler will [CSE](https://en.wikipedia.org/wiki/Common_subexpression_elimination) away those exact repeats of the same expression. And benchmarking with optimization disabled would be meaningless because un-optimized / anti-optimized machine code has *different* bottlenecks than normal. (Also, GCC might still CSE within one expression at `-O0`.) — Peter Cordes, Dec 28 '20 at 23:00
@Jere: instead of counting up, you'd actually want to hand-hold the compiler into using a down-counter. `if(--pgcount == 0) { /*new page*/; pgcount=page_limit; }`. That's more efficient in asm and equally readable in C, so if you're micro-optimizing you should write it that way. Related: [using that technique in hand-written asm FizzBuzz](https://stackoverflow.com/questions/28334034/fizzbuzz-in-assembly-segmentation-fault/37494090#37494090). Maybe also a [code review](https://codereview.stackexchange.com/a/253716/50567) of asm sum of multiples of 3 and 5, but it does nothing for no-match. — Peter Cordes, Dec 28 '20 at 23:08
@PeterCordes Good point. Was just suggesting to do like the dumbest thing possible, and if it has no net effect, it's not worth caring about. Like convert to string, extract the last digit, whatever's ridiculous and slow. — tadman, Dec 29 '20 at 00:13
@tadman: yes, string conversion could work. You have to be careful to pick something that actually *is* slow after the optimizer is done with it, though, otherwise it could run exactly the same speed and you'd draw the possibly false conclusion that it doesn't matter. — Peter Cordes, Dec 29 '20 at 01:11

Peter Cordes · Accepted Answer · 2023-08-21T00:21:07.413

(I assume you meant to write if(x%y==0) not if(x%y), to be equivalent to the counter.)

I don't think compilers will do this optimization for you, so it could be worth it. It's going to be smaller code-size, even if you can't measure a speed difference. The x % y == 0 way still branches (so is still subject to a branch misprediction those rare times when it's true). Its only advantage is that it doesn't need a separate counter variable, just some temporary registers at one point in the loop. But it does need the divisor every iteration.

Overall this should be better for code size, and isn't less readable if you're used to the idiom. (Especially if you use if(--page_count == 0) { page_count=page_limit; ... so all pieces of the logic are in two adjacent lines.)

If your page_limit were not a compile-time constant, this is even more likely to help. dec/jz that's only taken once per many decrements is a lot cheaper than div/test edx,edx/jz, including for front-end throughput. (div is micro-coded on Intel CPUs as about 10 uops, so even though it's one instruction it still takes up the front-end for multiple cycles, taking away throughput resources from getting surrounding code into the out-of-order back-end).

(With a constant divisor, it's still multiply, right shift, sub to get the quotient, then multiply and subtract to get the remainder from that. So still several single-uop instructions. Although there are some tricks for divisibility testing by small constants see @Cassio Neri's answer on Fast divisibility tests (by 2,3,4,5,.., 16)? which cites his journal articles; recent GCC may have started using these.)

But if your loop body doesn't bottleneck on front-end instruction/uop throughput (on x86), or the divider execution unit, then out-of-order exec can probably hide most of the cost of even a div instruction. It's not on the critical path so it could be mostly free if its latency happens in parallel with other computation, and there are spare throughput resources. (Branch prediction + speculative execution allow execution to continue without waiting for the branch condition to be known, and since this work is independent of other work it can "run ahead" as the compiler can see into future iterations.)

Still, making that work even cheaper can help the CPU see and handle a branch mispredict sooner. But modern CPUs with fast recovery can keep working on old instructions from before the branch while recovering. (What exactly happens when a skylake CPU mispredicts a branch? / Avoid stalling pipeline by calculating conditional early )

And of course a few loops do fully keep the CPU's throughput resources busy, not bottlenecking on cache misses or a latency chain. And fewer uops executed per iteration is more friendly to the other hyperthread (or SMT in general).

Or if you care about your code running on in-order CPUs (common for ARM and other non-x86 ISAs that target low-power implementations), the real work has to wait for the branch-condition logic. (Only hardware prefetch or cache-miss loads and things like that can be doing useful work while running extra code to test the branch condition.)

The divide work does take up more space in the ROB (ReOrder Buffer) so reduces how far ahead the CPU can see for out-of-order execution to find independent work between code before / after it.

Use a down-counter

Instead of counting up, you'd actually want to hand-hold the compiler into using a down-counter that can compile to dec reg / jz .new_page or similar; all normal ISAs can do that quite cheaply because it's the same kind of thing you'd find at the bottom of normal loops. (dec/jnz to keep looping while non-zero)

    if(--page_counter == 0) {
        /*new page*/;
        page_counter = page_limit;
    }

A down-counter is more efficient in asm and equally readable in C (compared to an up-counter), so if you're micro-optimizing you should write it that way. Related: using that technique in hand-written asm FizzBuzz. Maybe also a code review of asm sum of multiples of 3 and 5, but it does nothing for no-match so optimizing it is different.

Notice that page_limit is only accessed inside the if body, so if the compiler is low on registers it can easily spill that and only read it as needed, not tying up a register with it or with multiplier constants.

Or if it's a known constant, just a move-immediate instruction. (Most ISAs also have compare-immediate, but not all. e.g. MIPS and RISC-V only have compare-and-branch instructions that use the space in the instruction word for a target address, not for an immediate.) Many RISC ISAs have special support for efficiently setting a register to a wider constant than most instructions that take an immediate (like ARM movw with a 16-bit immediate, so 4092 can be done in one instruction for mov but not cmp: it doesn't fit in 12 bits rotated by an even count).

Compared to dividing (or multiplicative inverse), most RISC ISAs don't have multiply-immediate, and a multiplicative inverse is usually wider than one immediate can hold. (x86 does have multiply-immediate, but not for the form that gives you a high-half.) Divide-immediate is even rarer, not even x86 has that at all, but no compiler would use that unless optimizing for space instead of speed if it did exist.

CISC ISAs like x86 can typically multiply or divide with a memory source operand, so if low on registers the compiler could keep the divisor in memory (especially if it's a runtime variable). Loading once per iteration (hitting in cache) is not expensive. But spilling and reloading an actual variable that changes inside the loop (like page_count) could introduce a store/reload latency bottleneck if the loop is short enough and there aren't enough registers. (Although that might not be plausible: if your loop body is big enough to need all the registers, it probably has enough latency to hide a store/reload.)

score 4 · Answer 2 · edited Dec 29 '20 at 23:17

Most optimizing compilers will convert divide or modulo operations into multiply by pre-generated inverse constant and shift instructions if the divisor is a constant. Possibly also if the same divisor value is used repeatedly in a loop.
Modulo multiplies by inverse to get a quotient, then multiplies quotient by divisor to get a product, and then original number minus product will be the modulo.
Multiply and shift are fast instructions on reasonably recent X86 processors, but branch prediction can also reduce the time it takes for a conditional branch, so as suggested a benchmark may be needed to determine which is best.

score 1 · Answer 3 · answered Jan 01 '21 at 00:03

If somebody put it in front of me, I would rather it was:

const size_t page_limit = 4942;
size_t npages = 0, nitems = 0;
size_t pagelim = foo / page_limit;
size_t resid = foo % page_limit;

while (npages < pagelim || nitems < resid) {
    if (++nitems == page_limit) {
          /* start new page */
          nitems = 0;
          npages++;
    }
}

Because the program is now expressing the intent of the processing -- a bunch of things in page_limit sized chunks; rather than an attempt to optimize away an operation.

That the compiler might generate nicer code is just a blessing.

This won't compile quite as efficiently (two conditions in the loop), but is a good tradeoff between human-readability vs. efficiency. (Assuming you don't need the overall item number, e.g. to number the list items or something.) — Peter Cordes, Jan 04 '21 at 09:48

What is faster in C++: mod (%) or another counter?

3 Answers3

Use a down-counter