Context: Multichannel real time digital audio processing.
Access pattern: "Column-major", like so:
for (int sample = 0; sample < size; ++sample)
{
for (int channel = 0; channel < size; ++channel)
{
auto data = arr[channel][sample];
// do some computations
}
}
I'm seeking advice on how to make the life easier for the CPU and memory, in general. I realize interleaving the data would be better, but it's not possible.
My theory is, that as long as you sequentially access memory for a while, the CPU will prefetch it - will this hold for N (channel) buffers? What about size of the buffers, any "breaking points"?
Will it be very beneficial to have the channels in contiguous memory (increasing locality), or does that only hold for very small buffers (like, size of cache lines)? We could be talking buffersizes > 100 kb apart.
I guess there would also be a point where the time of the computational part makes memory optimizations negligible - ?
Is this a case, where manual prefetching makes sense?
I could test/profile my own system, but I only have that - 1 system. So any design choices I make may only positively affect that particular system. Any knowledge on these matters are appreciated, links, literature etc., platform specific knowledge.
Let me know if the question is too vague, I primarily thought it would be nice to have some wiki-ish experience / info on this area.
edit:
I created a program, that tests the three cases I mentioned (distant, adjecant and contiguous mentioned in supposedly increasing performance order), which tests these patterns on small and big data sets. Maybe people will run it and report anomalies.
#include <iostream>
#include <chrono>
#include <algorithm>
const int b = 196000;
const int s = 64 / sizeof(float);
const int extra_it = 16;
float sbuf1[s];
float bbuf1[b];
int main()
{
float sbuf2[s];
float bbuf2[b];
float * sbuf3 = new float[s];
float * bbuf3 = new float[b];
float * sbuf4 = new float[s * 3];
float * bbuf4 = new float[b * 3];
float use = 0;
while (1)
{
using namespace std;
int c;
bool sorb;
cout << "small or big test (0/1)? ";
if (!(cin >> sorb))
return -1;
cout << endl << "test distant buffers (0), contiguous access (1) or adjecant access (2)? ";
if (!(cin >> c))
return -1;
auto t = std::chrono::high_resolution_clock::now();
if (c == 0)
{
// "worst case scenario", 3 distant buffers constantly touched
if (sorb)
{
for (int k = 0; k < b * extra_it; ++k)
for (int i = 0; i < s; ++i)
{
sbuf1[i] = k; // static memory
sbuf2[i] = k; // stack memory
sbuf3[i] = k; // heap memory
}
}
else
{
for (int k = 0; k < s * extra_it; ++k)
for (int i = 0; i < b; ++i)
{
bbuf1[i] = k; // static memory
bbuf2[i] = k; // stack memory
bbuf3[i] = k; // heap memory
}
}
}
else if (c == 1)
{
// "best case scenario", only contiguous memory touched, interleaved
if (sorb)
{
for (int k = 0; k < b * extra_it; ++k)
for (int i = 0; i < s * 3; i += 3)
{
sbuf4[i] = k;
sbuf4[i + 1] = k;
sbuf4[i + 2] = k;
}
}
else
{
for (int k = 0; k < s * extra_it; ++k)
for (int i = 0; i < b * 3; i += 3)
{
bbuf4[i] = k;
bbuf4[i + 1] = k;
bbuf4[i + 2] = k;
}
}
}
else if (c == 2)
{
// "compromise", adjecant memory buffers touched
if (sorb)
{
auto b1 = sbuf4;
auto b2 = sbuf4 + s;
auto b3 = sbuf4 + s * 2;
for (int k = 0; k < b * extra_it; ++k)
for (int i = 0; i < s; ++i)
{
b1[i] = k;
b2[i] = k;
b3[i] = k;
}
}
else
{
auto b1 = bbuf4;
auto b2 = bbuf4 + b;
auto b3 = bbuf4 + b * 2;
for (int k = 0; k < s * extra_it; ++k)
for (int i = 0; i < b; ++i)
{
b1[i] = k;
b2[i] = k;
b3[i] = k;
}
}
}
else
break;
cout << chrono::duration_cast<chrono::milliseconds>(chrono::high_resolution_clock::now() - t).count() << " ms" << endl;
// basically just touching the buffers, avoiding clever optimizations
use += std::accumulate(sbuf1, sbuf1 + s, 0);
use += std::accumulate(sbuf2, sbuf2 + s, 0);
use += std::accumulate(sbuf3, sbuf3 + s, 0);
use += std::accumulate(sbuf4, sbuf4 + s * 3, 0);
use -= std::accumulate(bbuf1, bbuf1 + b, 0);
use -= std::accumulate(bbuf2, bbuf2 + b, 0);
use -= std::accumulate(bbuf3, bbuf3 + b, 0);
use -= std::accumulate(bbuf4, bbuf4 + b * 3, 0);
}
std::cout << use;
std::cin.get();
}
On my Intel i7-3740qm surprisingly, distant buffers consistently outperforms the more locality-friendly tests. It is close, however.