Does using SIMD have an initialisation cost

Question

Do any of the commonly used consumer devices have a power/frequency ramp-up period before the SIMD subsystem can either start at all or to work on full frequency? Do we measure the stall in clock cycles or microseconds?

Conversely, how many non-SIMD instructions can one typically execute before the SIMD performance is lost, or is such a condition detected by some other means?

I'm mostly interested in modern arm64 (Cortex-A53,55,75,77 implementations, M1).

EDIT

The Intel case seems to be reasonably covered in SIMD instructions lowering CPU frequency, which leads to further links stating a maximum 8.5us period for "hard transition", where the execution units are in a halt state (if I understood it correctly). Also it contradicts my intuition: using AVX-512 instructions requires apparently the frequency to be ramped down.

On modern x86, especially Intel, yes, there can be throughput penalties if the current frequency is too high for the types of SIMD instructions you're running (FP math vs. integer, 256, or 512-bit). [SIMD instructions lowering CPU frequency](https://stackoverflow.com/q/56852812) mentions some about the quarter-throughput for that period until the CPU decides to change frequency. (It was previously thought that this was "powering up the upper halves of the AVX units", but it's probably more like a mechanism to limit peak power / current.) — Peter Cordes, Dec 12 '22 at 07:51
Thanks. That covers the Intel case (hard transition can take up to 8.5us apparently) - and it's exactly the opposite of what I thought of -- one needs to actually ramp down the frequency. — Aki Suihkonen, Dec 12 '22 at 08:55
I don't have a definite answer for AArch64 but since most vector units are still just 128 bit I wouldn't expect a significant ramp-up. Seeing how GCC now uses [SVE in mempcy](https://sourceware.org/git/?p=glibc.git;a=commit;h=9f298bfe1f183804bb54b54ff9071afc0494906c) starting at only 128 bytes, there doesn't seem to be an expectation that using vectors is expensive. — Homer512, Dec 12 '22 at 12:06
ARM publishes Software Optimization Guides for many of its cores, which would be your source for official information. I have the one for Cortex A-72 handy. It's fairly detailed, and it doesn't mention any such penalty, so I assume that means there isn't one. Apple unfortunately does not publish this information, so any results for M1 would have to be empirical. — Nate Eldredge, Dec 15 '22 at 02:38

score 1 · Answer 1 · answered Dec 13 '22 at 17:26

1

This answer applies for PCs, not ARM64.

Do any of the commonly used consumer devices have a power/frequency ramp-up period before the SIMD subsystem can either start at all or to work on full frequency?

“no” for start at all. SSE is designed to be a replacement for x87 FPU. CPUs never power off just SIMD hardware because most programs occasionally use floating point math.

However, Intel CPUs power off some of the hardware. First time a program uses 32-byte or 64-byte vectors, they will run a lot slower, until transitioned to the proper power state.

For Intel Sandy Bridge, Ivy Bridge, Haswell, that penalty applies to 32-byte vectors.

For Intel Skylake, that penalty applies to 32-byte and 64-byte vectors, warmup duration is 56000 clock cycles or 14 μs.

For Intel Ice Lake and Tiger Lake, the penalty only applies to 64-byte vectors, warmup duration is about 50000 clock cycles.

During that warm-up period, throughput is halved and instructions have extra latency. Note that warm-up is agnostic to instruction set, it only applies to the size of the vectors. AVX1, AVX2 and AVX512 instructions which handle 16-byte vectors always run at full speed.

how many non-SIMD instructions can one typically execute before the SIMD performance is lost

Skylake CPUs revert to idle state after 2.7 million clock cycles (675 μs) is spent running instructions with ≤ 16 bytes SIMD width.

For more information, see microarchitecture guide by Agner Fog.

answered Dec 13 '22 at 17:26

Soonts

20,079
9
57
130

As I [mentioned in comments](https://stackoverflow.com/questions/74767981/does-using-simd-have-an-initialisation-cost/74788707#comment131955755_74767981), it's probably not a matter of "powering off" upper halves of SIMD units; instead it's about limiting their throughput for power reasons if current frequency is higher than allowed for that type of SIMD instruction. [SIMD instructions lowering CPU frequency](https://stackoverflow.com/q/56852812) has details from Intel about that. – Peter Cordes Dec 13 '22 at 17:40
Agner Fog's hypothesis of powering up would explain the observations, but an experiment like limiting turbo frequency could distinguish it: if CPU frequency is artificially kept below the "L2 license" with software P-state limits or disabling turbo, heavy 512-bit FMA instructions should be able to run full throughput without a warm-up period. – Peter Cordes Dec 13 '22 at 17:42
1

675 μs seems like an awkward length. For instance, I think Linux has timeslices on the order of 1 ms. So if a core is running two threads, where one uses 256/512-bit SIMD and the other does not, then the SIMD unit will go to sleep every time the non-SIMD thread runs (but first wasting power for half the timeslice), and then take the warm-up penalty every time it switches back. That doesn't sound fun. – Nate Eldredge Dec 15 '22 at 02:43
@PeterCordes I don’t disagree, but from the programmer’s perspective these two things (powering off pieces of SIMD units, and these power licenses) are pretty close. BTW, these power licenses is one of the 2 reasons why I’m using AMD CPUs since 2019, and the company I work for buys AMD processors as well. Another reason is Intel’s big.little architecture: our CPU-bound code is parallelized, our software requires AVX2 CPUs, and it works best on CPUs with many fast cores like in AMD’s Ryzen 9 5950X, as opposed to 8 fast + 16 slow ones, like in Intel’s i9-13900K. – Soonts Dec 15 '22 at 21:00
On some SIMD-heavy workloads, like x264 video encoding, the E-cores apparently do pretty well. Their SIMD units are only 128 bits wide, but they have surprisingly high throughput for a lot of instructions. Although a lot of x264 benchmarks are with some insane settings, like extremely high rez and/or superfast settings, so it's more of a bandwidth test, not doing much work per byte. Anyway, sure, AMD CPUs are very good these days, which is great. I'm not at all surprised if they're the best choice for some workloads like you describe, and have good energy efficiency. – Peter Cordes Dec 15 '22 at 21:21
@PeterCordes Yeah, but SIMD workloads are different. I mostly work on CAM/CAE software. Our performance-critical CPU running code is 90% FP64 math, 10% FP32 math, and little to no integer SIMD. I think x264 code is doing the opposite. I don’t think these video codecs are using floating point at all, their video pixels are 8-16 bits integers per channel. – Soonts Dec 15 '22 at 21:38
Yes, that's correct, all integer SIMD. The E-cores have 0.67c throughput for `vpaddb ymm`, (2 uops for any of 3 ports) or 1c throughput for `vfmadd ymm` (2 uops for any of 2 ports.) So either way the theoretical max throughput is half of a P-core. But smaller caches, and shared L2, can hurt some numeric workloads. And shallower out-of-order exec capabilities might also hurt some FP workloads that depend on that to hide latency. IDK. If you haven't benchmarked your application on E-cores, might be interesting to try it if you have one available. – Peter Cordes Dec 15 '22 at 22:02
The other consideration is that some algorithms don't scale perfectly with number of cores, like having more threads adds overhead. In that case, more slower cores isn't nearly as helpful, even if you are gaining 2x theoretical max FMA throughput from having 4 E cores instead of 1 P core. – Peter Cordes Dec 15 '22 at 22:04

Does using SIMD have an initialisation cost

1 Answers1