Size of numbers in float array have significant impact on performance - Why?

Question

UPDATE 2:

I've switched to initializing all arrays with fixed numbers. Why is performance faster when using modeDampingTermsExp:

vs using modeDampingTermsExp2 ?

It's about 10x faster.

Full code

#include <iostream>
#include <chrono>

using namespace std;

int bufferWriteIndex = 0;
float curSample = 0;
float tIncr = 0.1f;

float modeGainsTimesModeShapes[25] = { -0.144338, -1.49012e-08, -4.3016e-09, 7.45058e-09, -0, -0.25, 
-1.49012e-08, 4.77374e-16, -7.45058e-09, 0, -0.288675, 0, 4.3016e-09, 3.55271e-15, -0, -0.25, 
1.49012e-08, -1.4512e-15, 7.45058e-09, 0, -0.144338, 1.49012e-08, -4.30159e-09, -7.45058e-09, -0 };

float modeDampingTermsString[5] = { -8.03847, -30, -60, -90, -111.962 };
float damping[5] = { 1, 1, 1, 1, 1 };
float modeFrequenciesArr[5] = { 71419.1, 266564, 533137, 799710, 994855 };

float modeDampingTermsExp[5] = { 0.447604, 0.0497871, 0.00247875, 0.00012341, 1.37263e-05 };
float modeDampingTermsExp2[5] = { -0.803847, -3, -6, -9, -11.1962 };


int main(int argc, char** argv) {

    float subt = 0;
    int subWriteIndex = 0;
    auto now = std::chrono::high_resolution_clock::now();


    while (true) {

        curSample = 0;

        for (int i = 0; i < 5; i++) {

            //Slow version
            //damping[i] = damping[i] * modeDampingTermsExp2[i];

            //Fast version
            damping[i] = damping[i] * modeDampingTermsExp2[i];
            float cosT = 2 * damping[i];

            for (int m = 0; m < 5; m++) {
                curSample += modeGainsTimesModeShapes[i * 5 + m] * cosT;

            }
        }

        //t += tIncr;
        bufferWriteIndex++;


        //measure calculations per second
        auto elapsed = std::chrono::high_resolution_clock::now() - now;
        if ((elapsed / std::chrono::milliseconds(1)) > 1000) {
            now = std::chrono::high_resolution_clock::now();
            int idx = bufferWriteIndex;
            cout << idx - subWriteIndex << endl;
            subWriteIndex = idx;
        }

    }
}

UPDATE 1:

I changed tIncr to 0.1f to avoid a possible subnormal number as mentioned by @JaMit. I've also removed t += tIncr; from the calculation.

Full code:

#include <iostream>
#include <chrono>

using namespace std;

int bufferWriteIndex = 0;
float curSample = 0;
float t = 0;
float tIncr = 0.1f;

float modeGainsTimesModeShapes[25] = { -0.144338, -1.49012e-08, -4.3016e-09, 7.45058e-09, -0, -0.25, 
-1.49012e-08, 4.77374e-16, -7.45058e-09, 0, -0.288675, 0, 4.3016e-09, 3.55271e-15, -0, -0.25, 
1.49012e-08, -1.4512e-15, 7.45058e-09, 0, -0.144338, 1.49012e-08, -4.30159e-09, -7.45058e-09, -0 };

float modeDampingTermsString[5] = { -8.03847, -30, -60, -90, -111.962 };
float damping[5] = { 1, 1, 1, 1, 1 };
float modeFrequenciesArr[5] = { 71419.1, 266564, 533137, 799710, 994855 };
float modeDampingTermsExp[5];


int main(int argc, char** argv) {


    /*
    for (int m = 0; m < 5; m++) {
        modeDampingTermsExp[m] = exp(modeDampingTermsString[m] * tIncr);
    }*/


    for (int m = 0; m < 5; m++) {
        modeDampingTermsExp[m] = modeDampingTermsString[m] * tIncr;
    }


    //std::thread t1(audioStringSimCos);
    //t1.detach();
    float subt = 0;
    int subWriteIndex = 0;
    auto now = std::chrono::high_resolution_clock::now();


    while (true) {

        curSample = 0;

        for (int i = 0; i < 5; i++) {

            damping[i] = damping[i] * modeDampingTermsExp[i];
            float cosT = 2 * damping[i] * cos(t * modeFrequenciesArr[i]);

            for (int m = 0; m < 5; m++) {
                curSample += modeGainsTimesModeShapes[i * 5 + m] * cosT;

            }
        }

        //t += tIncr;
        bufferWriteIndex++;


        //measure calculations per second
        auto elapsed = std::chrono::high_resolution_clock::now() - now;
        if ((elapsed / std::chrono::milliseconds(1)) > 1000) {
            now = std::chrono::high_resolution_clock::now();
            int idx = bufferWriteIndex;
            cout << idx - subWriteIndex << endl;
            subWriteIndex = idx;
        }

    }
}

Now it runs faster WITH the exp in the intialization?

Output with exp:

around 7 million/s, and without exp:

around 1.5 million. This is confusing to me.

ORIGINAL POST:

It seems like the way I initialize modeDampingTermsExp in my small example has a huge impact on the performance of my calculations, where I access it. Here is my minimum, reproducible example:

#include <iostream>
#include <chrono>

using namespace std;

int bufferWriteIndex = 0;
float curSample = 0;
float t = 0;
float tIncr = 1.0f / 48000;

float modeGainsTimesModeShapes[25] = { -0.144338, -1.49012e-08, -4.3016e-09, 7.45058e-09, -0, -0.25, 
-1.49012e-08, 4.77374e-16, -7.45058e-09, 0, -0.288675, 0, 4.3016e-09, 3.55271e-15, -0, -0.25, 
1.49012e-08, -1.4512e-15, 7.45058e-09, 0, -0.144338, 1.49012e-08, -4.30159e-09, -7.45058e-09, -0 };

float modeDampingTermsString[5] = { -8.03847, -30, -60, -90, -111.962 };
float damping[5] = { 1, 1, 1, 1, 1 };
float modeFrequenciesArr[5] = { 71419.1, 266564, 533137, 799710, 994855 };
float modeDampingTermsExp[5];


int main(int argc, char** argv) {


    /*
    for (int m = 0; m < 5; m++) {
        modeDampingTermsExp[m] = exp(modeDampingTermsString[m] * tIncr);
    }*/


    for (int m = 0; m < 5; m++) {
        modeDampingTermsExp[m] = modeDampingTermsString[m] * tIncr;
    }


    //std::thread t1(audioStringSimCos);
    //t1.detach();
    float subt = 0;
    int subWriteIndex = 0;
    auto now = std::chrono::high_resolution_clock::now();


    while (true) {

        curSample = 0;

        for (int i = 0; i < 5; i++) {

            damping[i] = damping[i] * modeDampingTermsExp[i];
            float cosT = 2 * damping[i] * cos(t * modeFrequenciesArr[i]);

            for (int m = 0; m < 5; m++) {
                curSample += modeGainsTimesModeShapes[i * 5 + m] * cosT;

            }
        }

        t += tIncr;
        bufferWriteIndex++;


        //measure calculations per second
        auto elapsed = std::chrono::high_resolution_clock::now() - now;
        if ((elapsed / std::chrono::milliseconds(1)) > 1000) {
            now = std::chrono::high_resolution_clock::now();
            int idx = bufferWriteIndex;
            cout << idx - subWriteIndex << endl;
            subWriteIndex = idx;
        }
    }
}

When I initialize it like this

for (int m = 0; m < 5; m++) {
    modeDampingTermsExp[m] = exp(modeDampingTermsString[m] * tIncr);
}

using the exp function, performance is about 10 times slower than like this:

for (int m = 0; m < 5; m++) {
    modeDampingTermsExp[m] = modeDampingTermsString[m] * tIncr;
}

I measure the calculations per second unsing chrono right below the 2 nested for loops in the endless while(true) loop (snippet of the fulle example above):

//measure calculations per second
        auto elapsed = std::chrono::high_resolution_clock::now() - now;
        if ((elapsed / std::chrono::milliseconds(1)) > 1000) {
            now = std::chrono::high_resolution_clock::now();
            int idx = bufferWriteIndex;
            cout << idx - subWriteIndex << endl;
            subWriteIndex = idx;
        }

Using the exp function, my program gives the following output for example:

it stays at around 390k.

Using the other initialization without it, I get the following output:

around 3 - 3.5 million "samples" per second.

Why does the way I initialize the modeDampingTermsExp array impact performance later in the code where I access it? What am I missing here?

I am using Visual Studio 2019 with the following flags: /O2 /Oi /Ot /fp:fast

Thank you very much!

Global `now` does not seem to be used at all. Why `#include `? The code is not _minimal_. Can you get rid of all unnecessary staff? — Daniel Langr, May 16 '20 at 13:23
Not knowing why in detail, but `damping[i]` become 0 without `exp` and values around `1e-042` with `exp`. Multiplying by zero may be optimized. — MikeCAT, May 16 '20 at 13:24
You are taking the 48,000th root of a number. What is the result? Do you end up in the realm of [subnormals](https://stackoverflow.com/questions/8341395/what-is-a-subnormal-floating-point-number)? If so, [Why does changing 0.1f to 0 slow down performance by 10x?](https://stackoverflow.com/questions/9314534/why-does-changing-0-1f-to-0-slow-down-performance-by-10x/9314926#9314926) seems relevant. — JaMiT, May 16 '20 at 13:25
@JaMiT Thanks!Ive updated my original question. Now performance is faster using ```exp```at initialization? Why. — atie, May 16 '20 at 13:59
@MikeCAT Im not sure what exactly you mean-maybe you can look at the update I've posted, but ```damping[i]``` shouldn't become very close to zero now. — atie, May 16 '20 at 14:01
I inserted `for (int i = 0; i < 5; i++) cout << damping[i] << " "; cout << endl;` after `cout << idx - subWriteIndex << endl;`. I got `2.8026e-045 1.#INF 1.#INF 1.#INF 1.#INF` with `modeDampingTermsExp2` and got `0 0 0 0 0` with `modeDampingTermsExp` from UPDATE 2 code. I'm using g++ (GCC) 4.8.1 on Windows. — MikeCAT, May 16 '20 at 18:12

Size of numbers in float array have significant impact on performance - Why?

0 Answers0