Seeking knowledge on array of arrays memory performance

Question

Context: Multichannel real time digital audio processing.

Access pattern: "Column-major", like so:

for (int sample = 0; sample < size; ++sample)
{
    for (int channel = 0; channel < size; ++channel)
    {
        auto data = arr[channel][sample];
        // do some computations
    }
}

I'm seeking advice on how to make the life easier for the CPU and memory, in general. I realize interleaving the data would be better, but it's not possible.

My theory is, that as long as you sequentially access memory for a while, the CPU will prefetch it - will this hold for N (channel) buffers? What about size of the buffers, any "breaking points"?

Will it be very beneficial to have the channels in contiguous memory (increasing locality), or does that only hold for very small buffers (like, size of cache lines)? We could be talking buffersizes > 100 kb apart.

I guess there would also be a point where the time of the computational part makes memory optimizations negligible - ?

Is this a case, where manual prefetching makes sense?

I could test/profile my own system, but I only have that - 1 system. So any design choices I make may only positively affect that particular system. Any knowledge on these matters are appreciated, links, literature etc., platform specific knowledge.

Let me know if the question is too vague, I primarily thought it would be nice to have some wiki-ish experience / info on this area.

edit:

I created a program, that tests the three cases I mentioned (distant, adjecant and contiguous mentioned in supposedly increasing performance order), which tests these patterns on small and big data sets. Maybe people will run it and report anomalies.

#include <iostream>
#include <chrono>
#include <algorithm>

const int b = 196000;
const int s = 64 / sizeof(float);
const int extra_it = 16;
float sbuf1[s];
float bbuf1[b];

int main()
{

    float sbuf2[s];
    float bbuf2[b];
    float * sbuf3 = new float[s];
    float * bbuf3 = new float[b];
    float * sbuf4 = new float[s * 3];
    float * bbuf4 = new float[b * 3];
    float use = 0;

    while (1)
    {
        using namespace std;

        int c;
        bool sorb;

        cout << "small or big test (0/1)? ";
        if (!(cin >> sorb))
            return -1;

        cout << endl << "test distant buffers (0), contiguous access (1) or adjecant access (2)? ";

        if (!(cin >> c))
            return -1;


        auto t = std::chrono::high_resolution_clock::now();

        if (c == 0)
        {
            // "worst case scenario", 3 distant buffers constantly touched
            if (sorb)
            {
                for (int k = 0; k < b * extra_it; ++k)
                    for (int i = 0; i < s; ++i)
                    {
                        sbuf1[i] = k; // static memory
                        sbuf2[i] = k; // stack memory
                        sbuf3[i] = k; // heap memory
                    }
            }
            else
            {
                for (int k = 0; k < s * extra_it; ++k)
                    for (int i = 0; i < b; ++i)
                    {
                        bbuf1[i] = k; // static memory
                        bbuf2[i] = k; // stack memory
                        bbuf3[i] = k; // heap memory
                    }
            }

        }
        else if (c == 1)
        {
            // "best case scenario", only contiguous memory touched, interleaved
            if (sorb)
            {
                for (int k = 0; k < b * extra_it; ++k)
                    for (int i = 0; i < s * 3; i += 3)
                    {
                        sbuf4[i] = k;
                        sbuf4[i + 1] = k;
                        sbuf4[i + 2] = k;
                    }
            }
            else
            {
                for (int k = 0; k < s * extra_it; ++k)
                    for (int i = 0; i < b * 3; i += 3)
                    {
                        bbuf4[i] = k;
                        bbuf4[i + 1] = k;
                        bbuf4[i + 2] = k;
                    }
            }

        }
        else if (c == 2)
        {
            // "compromise", adjecant memory buffers touched
            if (sorb)
            {
                auto b1 = sbuf4;
                auto b2 = sbuf4 + s;
                auto b3 = sbuf4 + s * 2;

                for (int k = 0; k < b * extra_it; ++k)
                    for (int i = 0; i < s; ++i)
                    {
                        b1[i] = k;
                        b2[i] = k;
                        b3[i] = k;
                    }
            }
            else
            {
                auto b1 = bbuf4;
                auto b2 = bbuf4 + b;
                auto b3 = bbuf4 + b * 2;

                for (int k = 0; k < s * extra_it; ++k)
                    for (int i = 0; i < b; ++i)
                    {
                        b1[i] = k;
                        b2[i] = k;
                        b3[i] = k;
                    }
            }

        }
        else
            break;


        cout << chrono::duration_cast<chrono::milliseconds>(chrono::high_resolution_clock::now() - t).count() << " ms" << endl;

        // basically just touching the buffers, avoiding clever optimizations
        use += std::accumulate(sbuf1, sbuf1 + s, 0);
        use += std::accumulate(sbuf2, sbuf2 + s, 0);
        use += std::accumulate(sbuf3, sbuf3 + s, 0);
        use += std::accumulate(sbuf4, sbuf4 + s * 3, 0);

        use -= std::accumulate(bbuf1, bbuf1 + b, 0);
        use -= std::accumulate(bbuf2, bbuf2 + b, 0);
        use -= std::accumulate(bbuf3, bbuf3 + b, 0);
        use -= std::accumulate(bbuf4, bbuf4 + b * 3, 0);


    }

    std::cout << use;

    std::cin.get();
}

On my Intel i7-3740qm surprisingly, distant buffers consistently outperforms the more locality-friendly tests. It is close, however.

I suppose you couldn't write the code in a different language? [Fortran is fast with matrix ops](http://stackoverflow.com/questions/6030201/performance-of-fortran-matrix-operations) — erip, Dec 06 '15 at 21:52
Technically, but the problem is kinda language-agnostic as it revolves around the hardware. It is usually only in C/C++ you run into these problems (HPC) where it is relevant to consider. C++ happens to be the choice language for most audio applications. — Shaggi, Dec 06 '15 at 22:03
Is it possible to transpose the representation of the data to make the access type "row-major"? At least you would be sure of exploiting locality if it ever counts. — A.S.H, Dec 06 '15 at 22:25
That would be equivalent to interleaving the data. Yes, that would be better as said but it requires transforming all the memory anyway, making the algorithm a double-pass. It also makes other stuff more complex (ie. operations needing only one channel has to traverse N more memory in those cases). The pattern in the OP is the exceptional, uncommon case, I would rather not gimp the common cases for this. — Shaggi, Dec 06 '15 at 22:42
Real-time auio apps work with very *small* buffers. Access pattern won't likely matter. Your layout looks all right. — Karoly Horvath, Dec 07 '15 at 00:40
Yes, mostly. However you often work with audio history for the last couple of seconds, for instance, which is a pretty large amount of data. Hence, I've added a test that tests access patterns on small and large data sets. — Shaggi, Dec 07 '15 at 01:56

Seeking knowledge on array of arrays memory performance

0 Answers0