If CPU frequencies don't increase, how can CPU be faster for non-parallel code?

Question

CPUs are still "improving", but their frequency haven't improved a lot during the last 10 years.

I can understand that the transistor count increases because of smaller and smaller transistors, but I don't understand how a non-parallel program (most programs are non-parallel, I think?) can be executed much faster on new CPUs if the frequency doesn't increase.

I can understand why GPU can be faster with more transistors because they're parallel processors (is that the right term?) and they only execute parallel code.

But most software is non parallel, so to me it seems that new CPU should not become much faster than previous CPUs, unless most programs can be parallelized, which is not the case (I'm not sure, but what are typical algorithms that can't be parallelized?).

Are larger L1/L2/L3 cache sizes allowing new CPU to be faster? Or are there other things like new instructions or branching things?

What am I missing?

Just to complete the answer of Peter with an example: Intel Alder Lake Performance cores have 5 ALU (arithmetic logic unit), Apple M1 performance cores have 6 ALU, Intel Skylake had 4 ALU, Intel Sandy Bridge had 3 ALU (even Intel Core had surprisingly 3 ALU), Intel Atom had 2 ALU and for old architecture like Netburst (eg. P4) it is not clearly specified but it looks like it is 1. Since 20 years, the number of cycles to complete basic operations has been significantly reduced (especially floating-point ops and divisions) thanks to more transistor to execute instructions in a more parallel way. — Jérôme Richard, Sep 03 '22 at 22:48
@JérômeRichard: Pentium 4 had an `add` reciprocal-throughput of 0.25c (4 per clock), with a latency of 0.5 cycles (ALU clock ran at twice the core clock), per https://agner.org/optimize/instruction_tables.pdf which reports they could run on "subunit" ALU0 or ALU1. Agner's microarch PDF would have more details about which things competed for ports with which other things. Even in-order Pentium was dual-issue with simple integer instructions being able to pair in the U and V pipes. — Peter Cordes, Sep 04 '22 at 13:01

Peter Cordes · Accepted Answer · 2022-09-04T14:54:00.730

More and more programs are using threads for things that can be parallelized reasonably. But you're asking about single-threaded (or per-core) performance, and that's totally fine and interesting.

You're missing instruction-level parallelism (ILP) and increasing IPC (instructions per cycle).

Also, SIMD (x86's SSE, AVX, AVX-512, or ARM NEON, SVE) to get more work done per instruction, exploiting data-parallelism in a (potentially small) loop that way instead of or as well as with threading. But that isn't a big factor for many applications.

Work per clock is instructions/cycle x work/insn x threads. (threads is basically number of clocks ticking at once, if your program is running on multiple cores). Even if threads is 1, the other two factors can increase.

A problem with lots of data parallelism (e.g. summing an array, or adding 1 to every element) can expose that parallelism to the CPU in three ways: SIMD, instruction-level parallelism (e.g. unroll with multiple accumulators if there's a dependency chain like a sum), and thread-level parallelism.

These are all orthogonal. And some of them apply to problem that aren't data parallel, just different steps of a complicated program. IPC applies all the time. With good enough branch prediction, CPUs can see far ahead in the instruction stream and finding parallel work to do (especially memory-level parallelism), as long as the code isn't doing something like traverse a linked list where the next load address depends on the current load result. Then you bottleneck on load latency, with no memory-level parallelism (except for whatever work you're doing on each node.)

Some major factors

Larger caches improve hit rates and effective bandwidth, leading to fewer stalls. That raises average IPC. (Also smarter cache-replacement algorithms, like L3 adaptive replacement in Ivy Bridge.)

Actual DRAM bandwidth increases help, too, (especially with good HW prefetching), but DRAM B/W is shared between cores. L1/L2 cache are private in modern CPUs, and L3 bandwidth scales nicely as well with different cores accessing different parts of it. Still, DRAM often comes into play, especially in code that isn't carefully tuned for cache-blocking. DRAM latency is near constant (in absolute nanoseconds, so getting "worse" in core clock cycles), but memory clocks have been climbing significantly in the past decade.
Larger ReOrder Buffers (ROB) and schedulers (RS) allow CPUs to find ILP over larger windows. Similarly, larger load and store buffers allow more memory-level parallelism, e.g. tracking more in-flight cache-miss loads in parallel. And having a larger buffer before you have to stall if a store misses in cache.
Better branch prediction reduces how often this speculative work has to be discarded if the CPU finds it had guessed the wrong path for an earlier branch.
Wider pipelines allow higher peak IPC. At best, in high-throughput code (not a lot of stalls, and lots of ILP), this can be sustained.

Otherwise, it at least helps get to the next stall sooner, doing a burst of work. And to clear out instructions waiting in the ROB when a cache-miss load does finally arrive, making room for new work to potentially see some independent work later. If execution of a loop condition can get far ahead of the actual work in the loop, a mispredict of the loop exit branch might be resolved before the back-end runs out of work to do. So a max IPC higher than the steady-state bottleneck of a loop is useful for loops that aren't infinite.

If CPU frequencies don't increase, how can CPU be faster for non-parallel code?

1 Answers1

Some major factors