How does parallel processing solve Von Neumann's bottleneck?

Question

I've been reading about Von Neumann's bottleneck, and AFAIK, the problem lies in that the CPU should either fetch or modify data operations, but not both at the same time; since they both require accessing the same memory bus. So, the problem mainly is in the limited bus transfer rate. I've read about how to mitigate this problem, and it mentioned that parallel processing should solve it, it doesn't depend on 1 core only, so when a core is stuck in fetch operation, other cores are working in a separate manner which cuts the computation time drastically.

Is this a correct understanding ? if so, aren't all of these core share the same bus to memory ? which made the bottleneck from the beginning ?

Depends, think of core scheduling as thread switching with a little more efficiency and complication. The techniques we use today could not be applied to architectures in the past. Threading/process swapping is only 20 or so years old, while Von Neumann was stomping around in the 40's if I remember correctly. — Ross Bush, May 07 '16 at 00:23
@Irb but I think that Von Neumann's bottleneck is still applicable in these days, after all, that's what started the parallel programming techniques — a.u.r, May 07 '16 at 00:42
No, it's not. It's the opposite, the emergence of multi-core machines and parallel processing exacerbated the problem. You have one memory bus with a set bandwidth, and many cores competing to access it. Only one can access it at once. Parallel operations accessing the memory bus have to get serialized, so evidently many cores means each individual core has to wait more for access. — Drunken Code Monkey, May 07 '16 at 01:47

Todd Gamblin · Answer 1 · 2016-05-07T01:13:12.233

It doesn't. The Von Neumann bottleneck refers to the fact that the processor and memory sit on opposite sides of a slow bus. If you want to compute something, you have to move inputs across the bus, to the processor. Then you have to store the outputs to memory when the computation completes. Your throughput is limited by the speed of the memory bus.

Caches

Caches help to mitigate this problem for many workloads, by keeping a small amount of frequently used data close to the processor. If your workload reuses a lot of data, as many do, then you'll benefit from caching. However, if you are processing a data set that's too big to fit in cache, or if your algorithm doesn't have good data reuse, it may not benefit much from caching. Think processing a very large data set. You need to load all the data in and store it back out at least once. If you're lucky, your algorithm will only need to see each chunk of data once, and any reused values will stay in cache. If not, you may end up gong over the memory bus much more than once per data element.

Parallel processing

Parallel processing is a pretty broad term. Depending on how you do it, you may or may not get more bandwidth.

Shared Memory

The way shared memory processors are implemented today doesn't do much at all to solve the Von Neumann bottleneck. If anything, having more cores puts more strain on the bus, because now more processors need to fetch data from memory. You'll need more bandwidth to feed all of them. Case in point: many parallel algorithms are memory-bound, and they can't make use of all the cores on modern multi-core chips, specifically because they can't fetch data fast enough. Core counts are increasing, and the bandwidth per core will likely decrease in the limit, even if the total bandwidth increases from processor to processor.

NUMA

Modern memory buses are getting more and more complex, and you can do things to use them more effectively. For example, on NUMA machines, some memory banks are closer to some processors than others, and if you lay out data efficiently, you can get more bandwidth than if you just blindly fetched from anywhere in RAM. But, scaling shared memory is difficult -- see Distributed Shared Memory Machines for why it's going to be hard to scale shared memory machines to more than few thousand cores.

Distributed Memory

Distributed memory machines are a type of parallel machine. These are often called clusters -- they're basically just a bunch of nodes on the same network trying to do a common task. You can get linear bandwidth scaling across a cluster, if each processor fetches only from its local memory. But, this requires you to lay out your data efficiently, so that each processor has its own chunk. People call this data-parallel computing. If your data is mostly data-parallel, you can probably make use of lots of processors, and you can use all your memory bandwidth in parallel. If you can't parallelize your workload, or if you can't break the data up into chunks so that each is processed mostly by one or a few nodes, then you're back to having a sequential workload and you're still bound by the bandwidth of a single core.

Processor in Memory (PIM)

People have looked at alternative node architectures to address the Von Neumann bottleneck. The one most commonly cited is probably Processor-in-memory, or PIM. In these architectures, to get around memory bus issues, you embed some processors in the memory, kind of like a cluster but at smaller scale. Each tiny core can usually do a few different arithmetic operations on its local data, so you can do some operations very fast. Again, though, it can be hard to lay your data out in a way that actually makes this type of processing useful, but some algorithms can exploit it.

Summary

In summary, the Von Neumann bottleneck in a general-purpose computer, where the processor can perform any operation on data from any address in memory, comes from the fact that you have to move the data to the processor to compute anything.

Simply building a parallel machine doesn't fix the problem, especially if all your cores are on the same side of the memory bus. If you're willing to have many processors and spread them out so that they are closer to some data than other data, then you can exploit data parallelism to get more bandwidth. Clusters and PIM systems are harder to program than single-core CPUs, though, and not every problem is fundamentally data-parallel. So the Von Neumann bottleneck is likely to be with us for some time.

Drunken Code Monkey · Answer 2 · 2016-05-07T00:32:24.573

The Von Neumann bottleneck comes from the shared memory bus for code and data. If you ignore complex features of today's processors, and imagine a simple 8-bit Von Neumann processor with some RAM and some flash, the processor is constantly forced to wait for RAM operations to be completed before loading more data from flash. Today the mitigation is mostly done through the processor's L1 and L2 caches, and the branch predication logic embedded in the processor. Instructions can be preloaded into the cache and the memory bus is free to be used. Parallelization can help in specific workloads, but the reality is that today's computing paradigm is not really affected by this bottleneck much. Processors are very powerful, memories and buses very fast, and if you need more throughput you can just add more cache to the processor (as Intel does with Xeons, and AMD with Opterons). Parallelization is also more of a way of dodging the issue, your parallel workload is still subject to the same rules the processor architecture imposes. If anything, multi-threading should make the problem worse because of the multiple workloads competing for the same memory bus. Again, the solution was simply to add more cache between the memory bus and the processor cores.

As memories are getting faster, and processors not so much anymore, we might yet see this problem becoming an issue again. But then the birds are saying biocomputers are the future for general purpose computing, so hopefully the next major architecture will take past errors into consideration.

so, by adding more cache for instructions you can fetch more data while executing the current instruction. That's really interesting, however, I didn't get where parallel processing fits in solving the bottleneck. — a.u.r, May 07 '16 at 00:53
"just add more cache to the processor" does not solve the bottleneck and is not a solution that scales. If it was then Intel etc. would already have CPUs with GBs of L1 cache. Also adding cache just reduces the bottleneck and doesn't solve it. Cache is still only memory, just placed closer to the CPU. — herrtim, Apr 15 '17 at 16:24
I never claimed it solved it, just that cache is the main mitigation currently. There is no solution to the problem barring a major architecture redesign. Unless each core has each two memory buses, and then an additional global shared memory pool so that each core can communicate together, all synced and at extremely high speed. This is cost-prohibitive for what would be a minor increase in performance compared to more GHz. For a quad core CPU you would need the equivalent of four FSB's on the board, plus all the additional logic to sync and buffer them. — Drunken Code Monkey, May 13 '17 at 02:49