It doesn't. The Von Neumann bottleneck refers to the fact that the processor and memory sit on opposite sides of a slow bus. If you want to compute something, you have to move inputs across the bus, to the processor. Then you have to store the outputs to memory when the computation completes. Your throughput is limited by the speed of the memory bus.
Caches
Caches help to mitigate this problem for many workloads, by keeping a small amount of frequently used data close to the processor. If your workload reuses a lot of data, as many do, then you'll benefit from caching. However, if you are processing a data set that's too big to fit in cache, or if your algorithm doesn't have good data reuse, it may not benefit much from caching. Think processing a very large data set. You need to load all the data in and store it back out at least once. If you're lucky, your algorithm will only need to see each chunk of data once, and any reused values will stay in cache. If not, you may end up gong over the memory bus much more than once per data element.
Parallel processing
Parallel processing is a pretty broad term. Depending on how you do it, you may or may not get more bandwidth.
Shared Memory
The way shared memory processors are implemented today doesn't do much at all to solve the Von Neumann bottleneck. If anything, having more cores puts more strain on the bus, because now more processors need to fetch data from memory. You'll need more bandwidth to feed all of them. Case in point: many parallel algorithms are memory-bound, and they can't make use of all the cores on modern multi-core chips, specifically because they can't fetch data fast enough. Core counts are increasing, and the bandwidth per core will likely decrease in the limit, even if the total bandwidth increases from processor to processor.
NUMA
Modern memory buses are getting more and more complex, and you can do things to use them more effectively. For example, on NUMA machines, some memory banks are closer to some processors than others, and if you lay out data efficiently, you can get more bandwidth than if you just blindly fetched from anywhere in RAM. But, scaling shared memory is difficult -- see Distributed Shared Memory Machines for why it's going to be hard to scale shared memory machines to more than few thousand cores.
Distributed Memory
Distributed memory machines are a type of parallel machine. These are often called clusters -- they're basically just a bunch of nodes on the same network trying to do a common task. You can get linear bandwidth scaling across a cluster, if each processor fetches only from its local memory. But, this requires you to lay out your data efficiently, so that each processor has its own chunk. People call this data-parallel computing. If your data is mostly data-parallel, you can probably make use of lots of processors, and you can use all your memory bandwidth in parallel. If you can't parallelize your workload, or if you can't break the data up into chunks so that each is processed mostly by one or a few nodes, then you're back to having a sequential workload and you're still bound by the bandwidth of a single core.
Processor in Memory (PIM)
People have looked at alternative node architectures to address the Von Neumann bottleneck. The one most commonly cited is probably Processor-in-memory, or PIM. In these architectures, to get around memory bus issues, you embed some processors in the memory, kind of like a cluster but at smaller scale. Each tiny core can usually do a few different arithmetic operations on its local data, so you can do some operations very fast. Again, though, it can be hard to lay your data out in a way that actually makes this type of processing useful, but some algorithms can exploit it.
Summary
In summary, the Von Neumann bottleneck in a general-purpose computer, where the processor can perform any operation on data from any address in memory, comes from the fact that you have to move the data to the processor to compute anything.
Simply building a parallel machine doesn't fix the problem, especially if all your cores are on the same side of the memory bus. If you're willing to have many processors and spread them out so that they are closer to some data than other data, then you can exploit data parallelism to get more bandwidth. Clusters and PIM systems are harder to program than single-core CPUs, though, and not every problem is fundamentally data-parallel. So the Von Neumann bottleneck is likely to be with us for some time.