Indexing with modulo has a huge performance hit

Question

I have a simple code that sums elements from an array and returns them:

// Called with jump == 0
int performance(int jump, int *array, int size) {
  int currentIndex = 0;
  int total = 0;
  // For i in 1...500_000_000
  for (int i = 0; i < 500000000; i++) {
    currentIndex = (currentIndex + jump) % size;
    total += array[currentIndex];
  }
  return total;
}

I noticed a weird behavior: the presence of % size has a very large performance impact (~10x slower) even tho jump is 0 so it is constantly accessing the same array element (0). Just removing % size improves performance a lot.

I would have thought this was just the modulo computation that was making this difference, but now say I replace my sum line with total += array[currentIndex] % size; (thus also computing a modulo) the performance difference is almost unnoticeable.

I am compiling this with -O3 with clang on an arm64 machine.

What could be causing this?

@old_timer You'll find both files here : https://gist.github.com/PopFlamingo/abe364eabcadc78576ea9c1b2d642b1e — Pop Flamingo, Oct 22 '21 at 02:30
@old_timer I compiled them with -Os here but the performance difference is the same — Pop Flamingo, Oct 22 '21 at 02:30
external links are pretty much useless here, if it is not inline with the question (on the stackoverflow site/server), then it basically doesnt exist. divide is a pretty costly operation and modulo may or may not add even more, be it a library or an instruction, so this is not really that surprising, but the disassembly may show more cost than just the division/modulo. — old_timer, Oct 22 '21 at 03:52
You should calculate `jump %= size`, then `i=min(i+jump,i+jump-size)` with all variables unsigned for way better throughput for the modulo. — Aki Suihkonen, Oct 23 '21 at 06:13

Peter Cordes · Accepted Answer · 2021-10-22T03:58:54.417

4

Sounds normal for sdiv+msub latency to be about 10x add latency.

Even if this inlined for a compile-time-constant size that wasn't a power of two, that's still a multiplicative inverse and an msub (multiply-subtract) to get the remainder, so a dep chain of at least two multiplies and a shift.

Maybe an extra few instructions on the critical path for a signed remainder with with a constant size (even if positive) since the array is also signed int. e.g. -4 % 3 has to produce -1 in C.

See

say I replace my sum line with total += array[currentIndex] % size; (thus also computing a modulo)

That remainder isn't part of a loop-carried dependency chain. (https://fgiesen.wordpress.com/2018/03/05/a-whirlwind-introduction-to-dataflow-graphs/)

Multiple remainder calculations can be in flight in parallel, since the next array[idx] load address only depends on a += jump add instruction.

If you don't bottleneck on throughput limits, those remainder results could potentially be ready with 1/clock throughput, with OoO exec overlapping dep chains between iterations. The only latency bottlenecks are the loop counter/index and total += ..., both of which are just integer add which has 1 cycle latency.

So really, the bottleneck is likely going to be on throughput (of the whole loop body), not those latency bottlenecks, unless you're testing on an extremely wide CPU that can get a lot done every cycle. (Surprised you don't get more slowdown from introducing the % at all. Unless total is getting optimized away after inlining if you're not using the result.)

edited Oct 22 '21 at 03:58

answered Oct 22 '21 at 02:36

Peter Cordes

328,167
45
605
847

2

(I'm assuming the ARM64 you tested on does out-of-order exec, so independent multiplies in different iterations can pipeline with each other, without the compiler having to unroll and software-pipeline, although that would be possible for a high-end in-order CPU that doesn't stall if you don't try to use high-latency results right away, i.e. allowing out-of-order completion as long as execution starts in order.) Anyway, it would be a good idea to say which specific ARM64 microarchitecture you tested on, like Cortex-A57 or Apple M1, and also compiler version. – Peter Cordes Oct 22 '21 at 02:46
Sorry, I made a mistake indeed, I meant ` total += array[currentIndex] % size`. Thank you for this detailed answer, I am not familiar with some of the elements you cited, I'll search about them. – Pop Flamingo Oct 22 '21 at 03:19
@PopFlamingo: [What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?](https://stackoverflow.com/q/51607391) has detailed discussion, and more links, about analyzing code for this kind of bottleneck vs. other kinds. That's why I linked it. For more of a general intro, see https://fgiesen.wordpress.com/2018/03/05/a-whirlwind-introduction-to-dataflow-graphs/ and [How many CPU cycles are needed for each assembly instruction?](https://stackoverflow.com/a/44980899). – Peter Cordes Oct 22 '21 at 03:24
(OoO exec works similarly on ARM64 as on x86-64, the main difference being that x86-64 decodes some single instructions to multiple uops. ARM64 like all RISCs is more like 1:1 with instructions -> operations going through the pipeline.) – Peter Cordes Oct 22 '21 at 03:26
1

Of course you need an `msub` on top of your `sdiv` to get the remainder, which tacks on a few more cycles of latency. But signed `%` isn't any harder on ARM64 than unsigned; `sdiv / msub` gives the correct C result for all combinations of signs. – Nate Eldredge Oct 22 '21 at 03:51
@NateEldredge: I was talking about the case where it inlines into a caller with a compile-time-constant `size`, so it can use a multiplicative inverse. *That* still costs more instructions for signed, IIRC. But thanks, reworded that to only apply to the constant `size` case. – Peter Cordes Oct 22 '21 at 03:55
@PeterCordes: Got it, though glancing compiler outputs for `x % const` they look pretty similar for signed and unsigned. I'll have to think through them more carefully. Also, I wonder what kind of CPU OP has; for my Cortex A-72, the Software Optimization Manual says that integer divides "block any subsequent divide operations until complete", so one might not see much improvement in the case where the `sdiv`s don't have the loop-carried dependency. – Nate Eldredge Oct 22 '21 at 14:34
@NateEldredge: Yeah, it's common for division execution units not to be pipelined at all. It was only somewhat recently that Intel / AMD started doing that in their high-end x86 chips. The original 10x difference being asked about was division vs no division, which didn't rule out `sdiv`. But seeing that % off the critical path was nearly as fast does rule out sdiv for most ARM uarches I'd assume, leaving the conclusion it must inline for a constant size. (Or that some optimization could be transforming `total += x % invariant`? Like wider accumulator and only modulo once at the end?) – Peter Cordes Oct 22 '21 at 14:45
For signed vs. unsigned `x%5`, note the extra `asr`/`sub`: https://godbolt.org/z/Tz4ojsqET. (Oh, and instead of msub, it's using `add w1, w1, w1, lsl 2` to multiply, then a `sub`, since `5` is constant and simple, and apparently that's lower latency.) – Peter Cordes Oct 22 '21 at 14:46
@NateEldredge @PeterCordes My original question wasn't really clear indeed. I'm seeing a 5x speed difference with `currentIndex = (currentIndex + jump) % size;` vs `total += array[currentIndex] % size;` – Pop Flamingo Oct 22 '21 at 18:58
I'm using an Apple M1 processor for my tests – Pop Flamingo Oct 22 '21 at 18:58
I also set `__attribute__((noinline))` on the function tho I don't know how much of an impact it would have here. – Pop Flamingo Oct 22 '21 at 19:02
Thank you again for all of these details, I'll take time to read all of this in more detail. – Pop Flamingo Oct 22 '21 at 19:03

Indexing with modulo has a huge performance hit

1 Answers1