Does the Harvard architecture have the von Neumann bottleneck?

Question

From the naming and this article I feel the answer is no, but I don't understand why. The bottleneck is how fast you can fetch data from memory. Whether you can fetch instruction at the same time doesn't seem to matter. Don't you still have to wait until the data arrive? Suppose fetching data takes 100 cpu cycles and executing instruction takes 1, the ability of doing that 1 cycle in advance doesn't seem to be a huge improvement. What am I missing here?

Context: I came across this article saying the Spectre bug is not going to be fixed because of speculative execution. I think speculative execution, for example branch prediction, makes sense for Harvard architecture too. Am I right? I understand speculative execution is more beneficial for von Neumann architecture, but by how much? Can someone give a rough number? On what extent can we say the Spectre will stay because of von Neumann architecture?

Yes, the von Neumann bottleneck is that all computation happens in the CPU, not in parallel with the data in memory. It's still a problem even when executing in a tight loop that fits in I-cache, so there's no code-fetch eating up bandwidth. — Peter Cordes, Feb 26 '19 at 09:49
Thanks for the reply. My confusion is from my reading data-fetching is much slower than instruction-fetching, usually takes ~100 cpu cycles. Without data, an instruction still has to hang in the execution phase, right? So the gain seems to be tiny (1%?) to alleviate the bottleneck. I guess I'm missing something here, but not sure what. — tcya, Feb 27 '19 at 11:05
Sorry I missed your point. Actually you agree with me. Harvard architecture still has the bottleneck. Then that's not quite fair to call it von Neumann bottleneck. Feel bad for von Neumann :) — tcya, Feb 27 '19 at 11:19

Peter Cordes · Answer 1 · 2021-09-26T12:43:30.397

The term "von Neumann bottleneck" isn't just talking about Harvard vs. von Neumann architectures. It's talking about the entire idea of stored-program computers, which John von Neumann invented.

(Depending on context, some people may use it to mean the competition between code-fetch and data access; that does exacerbate the overall memory bottleneck without split caches. Or perhaps I'm mixing up terminology and the more general memory bottleneck for processors I discuss in the rest of this answer shouldn't be called the von Neumann bottleneck, although it is a real thing. See the memory wall section in Modern Microprocessors A 90-Minute Guide! )

The von Neumann bottleneck applies equally to both kinds of stored-program computers. And even to fixed-function (not stored-program) processors that keep data in RAM. (Old GPUs without programmable shaders are basically fixed-function but can still have memory bottlenecks accessing data).

Usually it's most relevant when looping over big arrays or pointer-based data structures like linked lists, so the code fits in an instruction cache and doesn't have to be fetched during data access anyway. (Computers too old to even have caches were just plain slow, and I'm not interested in arguing semantics of whether slowness even when there is temporal and/or spatial locality is a von Neumann bottleneck for them or not.)

https://whatis.techtarget.com/definition/von-Neumann-bottleneck points out that caching and prefetching is part of how we work around the von Neumann bottleneck, and that faster / wider busses make the bottleneck wider. But only stuff like Processor-in-Memory / https://en.wikipedia.org/wiki/Computational_RAM truly solves it, where an ALU is attached to memory cells directly, so there is no central bottleneck between computation and storage, and computational capacity scales with storage size. But von Neumann with a CPU and separate RAM works well enough for most things that it's not going away any time soon (given large caches and smart hardware prefetching, and out-of-order execution and/or SMT to hide memory latency.)

John von Neumann was a pioneer in early computing, and it's not surprising his name is attached to two different concepts.

Harvard vs. von Neumann is about whether program memory is in a separate address space (and a separate bus); that's an implementation detail for stored-program computers.

Spectre: yes, Spectre is just about data access and branch prediction, not accessing code as data or vice versa. If you can get a Spectre attack into program memory in a Harvard architecture in the first place (e.g. by running a normal program that makes system calls), then it can run the same as on a von Neumann.

I understand speculative execution is more beneficial for von Neumann architecture, but by how much?

What? No. There's no connection here at all. Of course, all high-performance modern CPUs are von Neumann. (With split L1i / L1d caches, but program and data memory are not separate, sharing the same address space and physical storage. Split L1 caches is often called "modified Harvard", which makes some sense on ISAs other than x86 where L1i isn't coherent with data caches so you need special flushing instructions before you can execute newly-stored bytes as code. x86 has coherent instruction caches, so it's very much an implementation detail.)

Some embedded CPUs are true Harvard, with program memory connected to Flash and data address space mapped to RAM. But often those CPUs are pretty low performance. Pipelined but in-order, and only using branch prediction for instruction prefetch.

But if you did build a very high performance CPU with fully separate program and data memories (so copying from one to the other would have to go through the CPU), there'd be basically zero different from modern high-performance CPUs. L1i cache misses are rare, and whether they compete with data access is not very significant.

I guess you'd have split caches all the way down, though; normally modern CPUs have unified L2 and L3 caches, so depending on the workload (big code size or not) more or less of L2 and L3 can end up holding code. Maybe you'd still use unified caches with one extra bit in the tag to distinguish code addresses from data addresses, allowing your big outer caches to be competitively shared between the two address-spaces.

Great answer. Distinguishing stored-program computer from von Neumann architecture clears most confusions. But I blame others:) Articles like [Wikipedia](https://en.wikipedia.org/wiki/Von_Neumann_architecture#Von_Neumann_bottleneck) (_The shared bus between the program memory and data memory leads to the von Neumann bottleneck_) and the one you [cited](https://whatis.techtarget.com/definition/von-Neumann-bottleneck) (_Von Neumann came up with the idea behind the stored program computer, our standard model, which is also known as the von Neumann architecture_) misled me. Thank you very much. — tcya, Feb 28 '19 at 02:40
*Technically*, PiM does not **solve** the bandwidth bottleneck. On-chip bandwidth is much less expensive, but it is not free/unlimited. PiM designs typically have lower performance processors sufficient for the kinds of bandwidth-intensive workloads attractive for PiM; increasing finer-grained on-chip bandwidth also reduces capacity. The capacity limit of a single DRAM chip is also a significant constraint. PnearM increases capacity at moderate bandwidth cost. "computational capacity scales with storage size" for NUMA also, the bandwidth is just a wee bit lower☺. Just nitpicking. — , Feb 28 '19 at 03:16
@tcya: yeah, the wikipedia phrasing makes it sound like code-fetch is a significant burden, and/or that simply sharing the bus is the only cause of the problem. That might have been true for ancient CPUs like 8086 (no cache, instruction fetch was one of the main bottlenecks), but maybe not even then: an instruction that accessed memory took so many extra cycles that there would be plenty of time for instruction prefetch. http://www.oocities.org/mc_introtocomputers/Instruction_Timing.PDF. vs. now https://agner.org/optimize/ where 3 to 4 instructions/clock is normal. — Peter Cordes, Feb 28 '19 at 03:18
@PaulA.Clayton: I hadn't looked into what PIM actually did; I had assumed there was an ALU with each DRAM cell or group of cells, but that doesn't actually make sense because many operations have 2 memory inputs. Good nitpick since I was just making stuff up rather than trying to simplify actual details :P — Peter Cordes, Feb 28 '19 at 03:21

Olsonist · Answer 2 · 2019-07-19T16:34:10.797

The Harvard Architecture, separated instruction and data memories, is a mitigation of the von Neumann bottleneck. Backus' original definition of the bottleneck addresses a slightly more general problem than just instruction or data fetch and talks about the CPU/memory interface. In the paragraph before the money quote Backus talks about looking at the actual traffic on this bus,

Ironically, a large part of the traffic is not useful data but merely names of data that most of it consists of names as well as operations and data used only to compute such names.

In a Harvard architecture with a separated I/D bus, that will not change. It will still largely consist of names.

So the answer is a hard no. The Harvard architecture mitigates the von Neumann bottleneck but it doesn't solve it. Bluntly, it's a faster von Neumann bottleneck.

Does the Harvard architecture have the von Neumann bottleneck?

2 Answers2

Linked