Cache-friendly array iteration patterns in multithreaded code

Question

I have a piece of code that iterates on an array of structs, making changes to each element in it:

const size_t nThreads = 4;

struct foo {
  int a = 0;
  float b = 10;
};

void process(foo * array, size_t i, size_t end) {
  for (; i < end; i++) array[i].a += array[i].b;
}

int main() {
  foo fooArray[512];
  process(fooArray, 0, 512);
  return 0;
}

Now, I plan to take advantage that no other code is accessing fooArray and that each element is independent from each other to split process() work on different parts of fooArray, each part in its own thread.

I know that usually each core has its own separate (L1 only?) cache and when they have overlapping cache areas, writes in one core cache must be synchronized with other caches pointing to it, causing data stall. I also know that usually access to RAM via FSB is synchronized, and that can also cause stalls in case of frequent cache miss by different cores.

Then the question is: With all this in mind, what is the best pattern to access fooArray considering that I want to minimize cache miss and data stalls: by stripes or by slices?

By slices:

void process(foo * array, size_t i, size_t end) {
  for (; i < end; i++) array[i].a += array[i].b;
}

for (int i = 0; i < nThreads; i++) {
  std::thread t = std::thread([fooArray, i]() {
    process(fooArray, i * 512/nThreads, (i+1) * 512/nThreads -1);
  }
  t.join();
}

In this case, each thread will iterate over a slice of the array (thread t1 will go through 0 to 127, t2 from 128 to 255, t3 from 256 to 383 and t4 from 384 to 511). I believe this would minimize data stalls by cache synchronization between cores since each slice may be handled by a different core cache, but may increase stalls due to FSB synchronized access if all cores try to fetch a cache line in the same time.

By stripes: (note the i+=nThreads)

 void process(foo * array, size_t i, size_t end) {
  for (; i < end; i+=nThreads) array[i].a += array[i].b;
}

for (int i = 0; i < nThreads; i++) {
  std::thread t = std::thread([fooArray, i]() {
    process(fooArray, i, 512);
  }
  t.join();
}

In this case, t1 would iterate over elements 0, 4, 8, 12, ..., 508, t1 over 1, 5, 9, 13, ..., 509, t3 over 2, 6, 10, 14, ..., 510 and t4 over 3, 7, 11, 15, ..., 511. I believe this would minimize FSB access because each cache miss would trigger access to RAM only once, but data stalls due cache synchronization would increase.

I think slices would perform better. It would be less harsh on the cache lines.. — Arunmu, Aug 30 '14 at 19:39
Slicing your data structure is a good idea as this will tell the compiler that you have already created a parallel code and is ready for execution. However, you now have to use pthread functions to make it real, you have some variable dependency on your code. You need to call pthread_create() function and this function will run all your three slices concurrently. — Juniar, Aug 30 '14 at 21:36
I think slices should work best. The prefetcher should be able to pick up that you are accessing consecutive memory areas and thus is should be able to deliver optimal performance. Plus, your code will not depend on the cache line size. That said, it looks like the amount of work you are doing on the data is minimal and you are likely to be bottlenecked by memory access speed. In case you are running on a multisocket system, you should make sure that the array is sliced so that each slice resides in a memory area that can be accessed fast by the core currently processing it. — bluescarni, Aug 30 '14 at 21:58
@bluescarni it can get arbitrarly complex, `process()` in fact is going to be user code. Think of `std::transform`, but the function get executed once by thread. Thinking of what you said, considering that each thread would spend about the same time on each array element, stripes seems to be better: think of 4 threads accessing elements 0, 1, 2, and 3 respectively (stripes), instead of 4 threads accessing 0, 4, 8, 12 (which will happen if by slices) — Leonardo, Aug 31 '14 at 04:30
@Juniar this is what `std::thread` is doing :) http://en.cppreference.com/w/cpp/thread/thread — Leonardo, Aug 31 '14 at 04:32

score 0 · Answer 1 · edited May 23 '17 at 11:57

You may refer the following SO link which contains the great information about this concepts and link of various information by experts on this topic: https://stackoverflow.com/a/23172120/2724703

While working on the performance related problem, we should always measure the execution time by yourself and understand the various parameter which may impact. First thing should be measure the time for all three cases.

Having said that, I think your "By slices" approach would be better than other because of following reason:

So to achieve cache friendly logic in our program, we should try to do linear memory access as much as possible. This would really be loved by cache mechanism, as the probability of memory in having the L1 cache would significantly higher. In this case linear access of array segments by threads would be far better than the other one. Now this concept is applicable if our data set is large and would not fit into the L1 cache. In your case in either scenario, entire data may easily fit into L1 cache.

So such technique would be more appropriate if our work/logic is substantial and in this case such complication may not give any benefit(sometime execution time may increase). It OK to aware about these concepts but in reality we should only try to optimize our code if we find by our measurement that this causes bottleneck.

I do not think should be any problem regarding the data stall due to cache synchronization between the cores. This overhead should be equally applicable in both the scenario.

Cache-friendly array iteration patterns in multithreaded code

1 Answers1