Is a multi-core CPU required to implement SIMD?
I found the following phrase "multiple processing elements" when reading Wikipedia about SIMD. So what's the difference between this phrase and "multi-core CPU"?
Is a multi-core CPU required to implement SIMD?
I found the following phrase "multiple processing elements" when reading Wikipedia about SIMD. So what's the difference between this phrase and "multi-core CPU"?
Every core has its own independent SIMD execution units. Using SIMD instructions in one core doesn't cost execution resources in other cores. Separate cores even on the same physical chip are independent so they can go to sleep separately to save power, and various other design reasons for keeping them isolated.
One exception that I'm aware of: AMD Bulldozer has two weak integer cores sharing a SIMD / FPU and sharing some cache. They call this a "cluster", and it's basically an alternative to Hyperthreading (SMT). See David Kanter's Bulldozer write-up on RealworldTech.
SIMD and multi-core are orthogonal: you can have multi-core without SIMD (maybe some ARM chips without an FPU / NEON), and you can have SIMD without multi-core.
Many examples of the latter, including most prominently early x86 chips like Pentium-MMX through Pentium III / Pentium 4 that has MMX / SSE1 / SSE2 but were single-core CPUs.
There are at least three different kinds of parallelism in programs:
Instruction-level parallelism: it's possible to overlap some of the work done by different instructions within the same single thread of execution, preserving the illusion of running every instruction one after another. Exploit it by building a pipelined CPU core, or superscalar (multiple instructions per clock), or even out-of-order execution. (See my answer on a question about that for details.)
When creating software: Expose this parallelism to the hardware by avoiding long dependency chains whenever possible. (e.g. replace sum += a[i++]
with sum1+=a[i]; sum2+=a[i+1]; i+=2;
: unroll with multiple accumulators). Or use arrays instead of linked lists, because the next address to load is computed cheaply, instead of being part of the data from memory you have to wait for on a cache miss. But mostly ILP is already there in "normal" code without doing anything special, and you build bigger / fancier hardware to find more of it, and increase the average instructions-per-clock.
Data parallelism: you need to do the same thing to every pixel of an image, or every sample in an audio file. (e.g. blend 2 images, or mix two audio streams). Exploit this by building parallel execution units into each CPU core so a single instruction can do 16 single-byte additions in parallel, giving you increased throughput with no increase in the amount of instructions you need to get through the CPU core per clock. This is SIMD: Single Instruction, Multiple Data.
Audio / video are the most well-known applications of this, where the speedups are massive because you can fit a lot of byte or 16-bit elements into a single fixed-width vector register.
Exploit SIMD by auto-vectorizing loops with smart compilers, or manually. SIMD turns sum += a[i];
into sum[0..3] += a[i+0..3]
(for 4 elements per vector, like with int
or float
with 32-bit vectors).
Thread/task-level parallelism: exploit with multi-core CPUs, expose to the hardware by manually writing multi-threaded code, or using OpenMP or other auto-parallelization tools to multi-thread a loop, or use a library function that starts multiple threads for a big matrix multiply or something.
Or more simply by running multiple separate programs at once. e.g. compile with make -j8
to keep 8 compile processes in flight at once. Coarse-grained task-level parallelism can also be exploited by running your workload on a cluster of multiple computers, or even distributed computing.
But multi-core CPUs make it possible / efficient to exploit fine-grained thread-level parallelism where tasks need to share lots of data (like a large array), or have low latency communication through shared memory. (e.g. with locks to protect different parts of shared data, or lockless programming.)
These three kinds of parallelism are orthogonal.
To sum a very large array of float
on a modern CPU:
You'd start one thread per CPU core, and have each core loop over a chunk of the array in shared memory. (Thread-level parallelism). This gives you a factor of 4 speedup, let's say. (Even that's maybe unrealistic because of memory bottlenecks, but you can imagine some other computationally intensive task that didn't require reading so much memory, running on a 28-core Xeon, or a dual-socket server with two of those chips...)
The code for each thread would use SIMD to do 4 or 8 adds per instruction, on each core separately. (SIMD). This gives you a factor of 4 or 8 speedup. (Or 16 with AVX512)
You'd unroll with let's say 8 vector accumulators to hide the latency of floating-point add. (ILP). Skylake's vaddps
instruction has a latency of 4 cycles and a throughput of 0.5 cycles (i.e. 2 per clock). So 8 accumulators is just barely enough to hide that latency and keep 8 FP add instructions in flight at once.
The total throughput gain over single-threaded scalar sum += a[i++]
is the product of all those speedup factors: 4 * 8 * 8
= 256x the throughput of a non-parallelized, non-vectorized, single-accumulator ILP-bottlenecked naive implementation like you'd get from gcc -O2
for a simple loop. clang -O3 -march=native -ffast-math
would give SIMD, and some ILP (because clang knows how to use multiple accumulators when unrolling, often using 4, unlike gcc.)
You'd need OpenMP or other auto-parallelization to exploit multiple cores.
Related: Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? for a more in-depth look at multiple accumulators for ILP, and SIMD, for an FMA loop.
No, each core normally can perform most general operations from the instruction set. But the "multiple processing elements" for SIMD operations just perform a single operation on different data (different bytes or words).
For example, each core of ARM Cortex-A53 microarchitecture has capability to run SIMD instructions independently of other cores, while such SIMD instruction sets as MMX, SSE and SSE2 were first introduced on single-core CPUs.
Yes. It does. But only from the marketing point of view. It would be difficult to sell uP or uC with no SIMD instructions.