9

Is a multi-core CPU required to implement SIMD?

I found the following phrase "multiple processing elements" when reading Wikipedia about SIMD. So what's the difference between this phrase and "multi-core CPU"?

enter image description here

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Jason Yu
  • 1,886
  • 16
  • 26
  • 9
    No. "SI" = single instruction, "MD" = multiple data. The core needs to have, say, 4 multiplier circuits available so the single instruction can multiply 4 numbers at the same time. The parallelism is in the core itself. – Hans Passant Apr 13 '18 at 06:43
  • 1
    @ShreckYe I think the intent of the question was whether implementing SIMD requires a multi-core CPU, not whether a multi-core CPU requires SIMD. The original was a grammatical mess that needed fixing, but I you picked the other interpretation that doesn't match the answers. (I edited it to ask the question the answers are answering.) – Peter Cordes Aug 09 '20 at 00:56
  • @PeterCordes Got it. Thanks for pointing out. – Shreck Ye Aug 09 '20 at 12:19

3 Answers3

12

Every core has its own independent SIMD execution units. Using SIMD instructions in one core doesn't cost execution resources in other cores. Separate cores even on the same physical chip are independent so they can go to sleep separately to save power, and various other design reasons for keeping them isolated.

One exception that I'm aware of: AMD Bulldozer has two weak integer cores sharing a SIMD / FPU and sharing some cache. They call this a "cluster", and it's basically an alternative to Hyperthreading (SMT). See David Kanter's Bulldozer write-up on RealworldTech.

SIMD and multi-core are orthogonal: you can have multi-core without SIMD (maybe some ARM chips without an FPU / NEON), and you can have SIMD without multi-core.

Many examples of the latter, including most prominently early x86 chips like Pentium-MMX through Pentium III / Pentium 4 that has MMX / SSE1 / SSE2 but were single-core CPUs.


There are at least three different kinds of parallelism in programs:

  • Instruction-level parallelism: it's possible to overlap some of the work done by different instructions within the same single thread of execution, preserving the illusion of running every instruction one after another. Exploit it by building a pipelined CPU core, or superscalar (multiple instructions per clock), or even out-of-order execution. (See my answer on a question about that for details.)

    When creating software: Expose this parallelism to the hardware by avoiding long dependency chains whenever possible. (e.g. replace sum += a[i++] with sum1+=a[i]; sum2+=a[i+1]; i+=2;: unroll with multiple accumulators). Or use arrays instead of linked lists, because the next address to load is computed cheaply, instead of being part of the data from memory you have to wait for on a cache miss. But mostly ILP is already there in "normal" code without doing anything special, and you build bigger / fancier hardware to find more of it, and increase the average instructions-per-clock.

  • Data parallelism: you need to do the same thing to every pixel of an image, or every sample in an audio file. (e.g. blend 2 images, or mix two audio streams). Exploit this by building parallel execution units into each CPU core so a single instruction can do 16 single-byte additions in parallel, giving you increased throughput with no increase in the amount of instructions you need to get through the CPU core per clock. This is SIMD: Single Instruction, Multiple Data.

    Audio / video are the most well-known applications of this, where the speedups are massive because you can fit a lot of byte or 16-bit elements into a single fixed-width vector register.

    Exploit SIMD by auto-vectorizing loops with smart compilers, or manually. SIMD turns sum += a[i]; into sum[0..3] += a[i+0..3] (for 4 elements per vector, like with int or float with 32-bit vectors).

  • Thread/task-level parallelism: exploit with multi-core CPUs, expose to the hardware by manually writing multi-threaded code, or using OpenMP or other auto-parallelization tools to multi-thread a loop, or use a library function that starts multiple threads for a big matrix multiply or something.

    Or more simply by running multiple separate programs at once. e.g. compile with make -j8 to keep 8 compile processes in flight at once. Coarse-grained task-level parallelism can also be exploited by running your workload on a cluster of multiple computers, or even distributed computing.

    But multi-core CPUs make it possible / efficient to exploit fine-grained thread-level parallelism where tasks need to share lots of data (like a large array), or have low latency communication through shared memory. (e.g. with locks to protect different parts of shared data, or lockless programming.)

These three kinds of parallelism are orthogonal.

To sum a very large array of float on a modern CPU:

You'd start one thread per CPU core, and have each core loop over a chunk of the array in shared memory. (Thread-level parallelism). This gives you a factor of 4 speedup, let's say. (Even that's maybe unrealistic because of memory bottlenecks, but you can imagine some other computationally intensive task that didn't require reading so much memory, running on a 28-core Xeon, or a dual-socket server with two of those chips...)

The code for each thread would use SIMD to do 4 or 8 adds per instruction, on each core separately. (SIMD). This gives you a factor of 4 or 8 speedup. (Or 16 with AVX512)

You'd unroll with let's say 8 vector accumulators to hide the latency of floating-point add. (ILP). Skylake's vaddps instruction has a latency of 4 cycles and a throughput of 0.5 cycles (i.e. 2 per clock). So 8 accumulators is just barely enough to hide that latency and keep 8 FP add instructions in flight at once.

The total throughput gain over single-threaded scalar sum += a[i++] is the product of all those speedup factors: 4 * 8 * 8 = 256x the throughput of a non-parallelized, non-vectorized, single-accumulator ILP-bottlenecked naive implementation like you'd get from gcc -O2 for a simple loop. clang -O3 -march=native -ffast-math would give SIMD, and some ILP (because clang knows how to use multiple accumulators when unrolling, often using 4, unlike gcc.)

You'd need OpenMP or other auto-parallelization to exploit multiple cores.

Related: Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? for a more in-depth look at multiple accumulators for ILP, and SIMD, for an FMA loop.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • Related: [Does AVX/AVX2 "exists" on each core?](https://stackoverflow.com/q/66295099) (yes). – Peter Cordes Aug 30 '22 at 15:08
  • correction: "data parallelism" is usually defined more generally as having the same work to do over lots of elements. As I discussed later in the answer, you can use ILP, SIMD, and threading to exploit that data parallelism. So the middle bullet should be "SIMD parallelism", for getting more work done per slot in the instruction pipeline. – Peter Cordes Jul 23 '23 at 22:01
7

No, each core normally can perform most general operations from the instruction set. But the "multiple processing elements" for SIMD operations just perform a single operation on different data (different bytes or words).

For example, each core of ARM Cortex-A53 microarchitecture has capability to run SIMD instructions independently of other cores, while such SIMD instruction sets as MMX, SSE and SSE2 were first introduced on single-core CPUs.

Andriy Makukha
  • 7,580
  • 1
  • 38
  • 49
-2

Yes. It does. But only from the marketing point of view. It would be difficult to sell uP or uC with no SIMD instructions.

0___________
  • 60,014
  • 4
  • 34
  • 74
  • 1
    Your last sentence doesn't make sense. AVR microcontrollers for example are single-core without SIMD. Did you mean it would be hard to sell a single-core CPU *with* SIMD? Sort of true these days, but Intel certainly had fun selling Pentium-MMX (their first SIMD CPU, and one of the earliest SIMD implementations in any mainstream CPU). I still remember their ad campaign with people in colourful clean-room "bunny" suits https://www.intel.com/pressroom/archive/releases/1997/CN12297A.HTM, and google for `intel pentium mmx ads`. SIMD predates multi-core for mainstream CPUs by years. – Peter Cordes Apr 13 '18 at 07:03
  • Are AVRs modern or multicore? 8051 did not as well :). No **modern** uP or uC (single or multicore ) design will be accepted by the bosses without the SIMD instructions as users need them (in uC for the DSP). – 0___________ Apr 13 '18 at 07:08
  • But even these days, [ARM Cortex-A17 for example](https://en.wikipedia.org/wiki/ARM_Cortex-A17) is available in 1 to 4 core configurations, with NEON for integer / FP SIMD. I think NEON instructions are not totally rare even in low-end chips. It lets it copy more memory per instruction, and that matters on simple in-order chips. – Peter Cordes Apr 13 '18 at 07:08
  • [AVR dates from 1997, and is still widely used](https://en.wikipedia.org/wiki/Atmel_AVR), for example in https://www.arduino.cc/ boards, and you can find lots of AVR questions on SO. It's a modern 8-bit RISC ucontroller ISA with 32 architectural registers. Existing implementations are pipelined in-order, usually with some on-chip SRAM and optional external memory. Instruction set: https://www.microchip.com/webdoc/avrassembler/avrassembler.wb_ADC.html – Peter Cordes Apr 13 '18 at 07:13
  • $2.5 72 MHz Stm32f303 single core has them. – 0___________ Apr 13 '18 at 07:17
  • Yeah, NEON is common, but not needed everywhere. There are definitely low-end ARM chips without NEON, too. For example, [ARM Cortex-R8](https://en.wikipedia.org/wiki/ARM_Cortex-R) was announced in 2016, and even the FPU is optional. [ARM's main NEON page](https://developer.arm.com/technologies/neon) says it's "an extension to the Armv8-A and Armv8-R profiles." so it's optional even on modern ARMv8-A. – Peter Cordes Apr 13 '18 at 07:23
  • AVR is not moderator, it is only widely used. I used to use them as well but nowadays there is no reason (except maybe tiny 8 pin ones) of using them in the new designs. It is my my opinion, but it is hard to find the reason of using more expensive, slower, with obsolete peripherals and less energy efficient uCs.. – 0___________ Apr 13 '18 at 07:24
  • Anyway, you're answering the opposite of what the question asked. The existence of single-core CPUs *with* SIMD is what confirms that SIMD is separate from multi-core. There are plenty of single-core ARM and MIPS CPUs with SIMD, as well as boatloads of x86 chips with MMX/SSE/SSE2 before Intel / AMD started making dual core chips. And obviously the real answer is that each core has its own SIMD execution units. – Peter Cordes Apr 13 '18 at 07:24
  • 1
    I haven't done much embedded stuff, I don't know that much about the power / speed / cost tradeoffs of different ucontrollers. I mostly just see asm questions about them on SO and read out of interest. I found [Why are Atmel AVRs so popular?](//electronics.stackexchange.com/q/2324) which was interesting. It seems they may be better for low-power applications than ARM, but some commenters say they're best for hobbyists and commercial designs more often use PIC or MSP430. – Peter Cordes Apr 13 '18 at 07:45
  • @PeterCordes do not trust everything you read online. AVR-s a popular because: they are very long time on the market, have a lots very small uC (8 pins, very limited resources) good for applications (kettles, drills, coffee machines, washers etc etc.) MSP430 it was interesting 10 years ago but they lost their race due to very buggy silicons and many limitations. PIC-s - buggy as well, the developer should start reading documentation from the errata to avoid serious disappointments. Another problem is a Microchip politics limiting availability of the cheap or free tools. – 0___________ Apr 13 '18 at 08:18
  • I personally do a lots of uC projects and ARM for me is a best alternative.Silicons are available from many vendors - having different hardware and peripherals so it is always possible to choose the best solution for the task. You wrote about the ARM infopages. FPU is optional but ARM does not sell any uCs. The actual manufacturers have to add those optional features to be competitive and as I wrote without them (FPU, SIMD etc etc) the product will be not good enough to survive on the market. – 0___________ Apr 13 '18 at 08:21
  • PS - Love downvoters who do not write why they downvote. Great approach. – 0___________ Apr 13 '18 at 08:22
  • 1
    I downvoted this for now because I think you misread the question. I think the OP is asking whether multiple cores are required for SIMD, not whether SIMD is required on a multi-core CPU. (i.e. they don't understand that thread-level parallelism is different from data-level parallelism). Like I've said multiple times, you're answering the opposite of the question. (Also, clearly there are non-negligible sales of processors without SIMD; apparently people want them for very-low-end hobby projects like AVR, or to run code that didn't auto-vectorize and didn't get manually vectorized.) – Peter Cordes Apr 13 '18 at 08:30