I have read a lot about Cache Oblivious Algorithms and Streaming trees etc. I understand the basics what I am still unable to see is why they are good for parallel programming? I think I saw John Harrop stated that they are revolutionary for that.
3 Answers
In the article http://www.1024cores.net/home/parallel-computing/cache-oblivious-algorithms
They point out that
The idea behind cache-oblivious algorithms is efficient usage of processor caches and reduction of memory bandwidth requirements. Both things are equally important for single-threaded algorithms, but especially crucial for parallel algorithms, because available memory bandwidth is usually shared between hardware threads and frequently becomes a bottleneck for scalability.
Access to memory can be a bottle neck in parallel algorithms, so having algorithms that attempt to utilize cached memory can be more efficient.
In the same article they go on to describe how cache oblivious algorithms take advantage of the cache available:
Cache-oblivious algorithms work by recursively dividing a problem's dataset into smaller parts and then doing as much computations of each part as possible. Eventually subproblem dataset fits into cache, and we can do significant amount of computations on it without accessing memory

- 6,774
- 1
- 35
- 37
I just want to point out that your question is particularly important within the multicore architecture where multiple processors have their own private caches and shared caches between several cores. To be able to achieve high efficiency and scalability, a parallel algorithm has to demonstrate a good spatial locality and temporal locality in data caches.
Before Harald Prokop's master thesis, algorithms and data structures were designed in a cache-aware (cache-conscious) way to reduce the ratio of cache misses, for example, B-tree is a well-known example of cache-aware data structures in which the parameter B is tuned by using the CPU cache size. In the cache-oblivious model, due to the recursive nature of algorithms, subproblems eventually fit in caches and manipulating such subproblems incur a small number of cache misses.
Some nice properties of cache-oblivious algorithms are independent from CPU cache sizes, working well on any memory hierarchy and proved to be optimal in cache complexity. On the rise of multicore parallelism, cache-oblivious algorithms may play an important role in deriving performant parallel programs. I also see some interesting discussion about recursive data structures and cache-oblivious algorithms in the following article http://blogs.msdn.com/b/devdev/archive/2007/06/12/cache-oblivious-data-structures.aspx.

- 41,040
- 7
- 92
- 166
Multicore processors have less cache per core. The cache in dedicated single-core processor takes up a huge amount of space on the silicon. You can see for yourself by just google image searching; you will find the cache sizes are huge, e.g. http://www.bit-tech.net/hardware/memory/2007/11/15/the_secrets_of_pc_memory_part_1/2
Thus if you have 20 cores and they each have 1/20 the cache of a normal processor, you'd better hope your algorithms perform well without a giant cache. =)

- 88,546
- 24
- 137
- 145