I have a piece of code that iterates on an array of structs, making changes to each element in it:
const size_t nThreads = 4;
struct foo {
int a = 0;
float b = 10;
};
void process(foo * array, size_t i, size_t end) {
for (; i < end; i++) array[i].a += array[i].b;
}
int main() {
foo fooArray[512];
process(fooArray, 0, 512);
return 0;
}
Now, I plan to take advantage that no other code is accessing fooArray
and that each element is independent from each other to split process()
work on different parts of fooArray
, each part in its own thread.
I know that usually each core has its own separate (L1 only?) cache and when they have overlapping cache areas, writes in one core cache must be synchronized with other caches pointing to it, causing data stall. I also know that usually access to RAM via FSB is synchronized, and that can also cause stalls in case of frequent cache miss by different cores.
Then the question is: With all this in mind, what is the best pattern to access fooArray
considering that I want to minimize cache miss and data stalls: by stripes or by slices?
By slices:
void process(foo * array, size_t i, size_t end) {
for (; i < end; i++) array[i].a += array[i].b;
}
for (int i = 0; i < nThreads; i++) {
std::thread t = std::thread([fooArray, i]() {
process(fooArray, i * 512/nThreads, (i+1) * 512/nThreads -1);
}
t.join();
}
In this case, each thread will iterate over a slice of the array (thread t1 will go through 0 to 127, t2 from 128 to 255, t3 from 256 to 383 and t4 from 384 to 511). I believe this would minimize data stalls by cache synchronization between cores since each slice may be handled by a different core cache, but may increase stalls due to FSB synchronized access if all cores try to fetch a cache line in the same time.
By stripes: (note the i+=nThreads
)
void process(foo * array, size_t i, size_t end) {
for (; i < end; i+=nThreads) array[i].a += array[i].b;
}
for (int i = 0; i < nThreads; i++) {
std::thread t = std::thread([fooArray, i]() {
process(fooArray, i, 512);
}
t.join();
}
In this case, t1 would iterate over elements 0, 4, 8, 12, ..., 508, t1 over 1, 5, 9, 13, ..., 509, t3 over 2, 6, 10, 14, ..., 510 and t4 over 3, 7, 11, 15, ..., 511. I believe this would minimize FSB access because each cache miss would trigger access to RAM only once, but data stalls due cache synchronization would increase.