More and more programs are using threads for things that can be parallelized reasonably. But you're asking about single-threaded (or per-core) performance, and that's totally fine and interesting.
You're missing instruction-level parallelism (ILP) and increasing IPC (instructions per cycle).
Also, SIMD (x86's SSE, AVX, AVX-512, or ARM NEON, SVE) to get more work done per instruction, exploiting data-parallelism in a (potentially small) loop that way instead of or as well as with threading. But that isn't a big factor for many applications.
Work per clock is instructions/cycle x work/insn x threads. (threads
is basically number of clocks ticking at once, if your program is running on multiple cores). Even if threads is 1, the other two factors can increase.
A problem with lots of data parallelism (e.g. summing an array, or adding 1 to every element) can expose that parallelism to the CPU in three ways: SIMD, instruction-level parallelism (e.g. unroll with multiple accumulators if there's a dependency chain like a sum), and thread-level parallelism.
These are all orthogonal. And some of them apply to problem that aren't data parallel, just different steps of a complicated program. IPC applies all the time. With good enough branch prediction, CPUs can see far ahead in the instruction stream and finding parallel work to do (especially memory-level parallelism), as long as the code isn't doing something like traverse a linked list where the next load address depends on the current load result. Then you bottleneck on load latency, with no memory-level parallelism (except for whatever work you're doing on each node.)
Some major factors
Larger caches improve hit rates and effective bandwidth, leading to fewer stalls. That raises average IPC. (Also smarter cache-replacement algorithms, like L3 adaptive replacement in Ivy Bridge.)
Actual DRAM bandwidth increases help, too, (especially with good HW prefetching), but DRAM B/W is shared between cores. L1/L2 cache are private in modern CPUs, and L3 bandwidth scales nicely as well with different cores accessing different parts of it. Still, DRAM often comes into play, especially in code that isn't carefully tuned for cache-blocking. DRAM latency is near constant (in absolute nanoseconds, so getting "worse" in core clock cycles), but memory clocks have been climbing significantly in the past decade.
Larger ReOrder Buffers (ROB) and schedulers (RS) allow CPUs to find ILP over larger windows. Similarly, larger load and store buffers allow more memory-level parallelism, e.g. tracking more in-flight cache-miss loads in parallel. And having a larger buffer before you have to stall if a store misses in cache.
Better branch prediction reduces how often this speculative work has to be discarded if the CPU finds it had guessed the wrong path for an earlier branch.
Wider pipelines allow higher peak IPC. At best, in high-throughput code (not a lot of stalls, and lots of ILP), this can be sustained.
Otherwise, it at least helps get to the next stall sooner, doing a burst of work. And to clear out instructions waiting in the ROB when a cache-miss load does finally arrive, making room for new work to potentially see some independent work later. If execution of a loop condition can get far ahead of the actual work in the loop, a mispredict of the loop exit branch might be resolved before the back-end runs out of work to do. So a max IPC higher than the steady-state bottleneck of a loop is useful for loops that aren't infinite.
See also