0

I wrote a code to calculate moving L2 norm of two arrays.

func_lstl2(const int &nx, const float x[],const int &ny, const float y[], int &shift, double &lstl2)
{    

    int maxshift = 200;
    int len_z = maxshift * 2;
    int len_work = len_z + ny;
    //initialize array work and array z
    double *z = new double[len_z]; float *work = new float[len_work];
    for (int i = 0; i < len_z; i++)
        z[i] = 0;
    for (int i = 0; i < len_work; i++)
        work[i] = 0;
    for (int i = 0; i < ny; i++)
        work[i + maxshift] = y[i];
    // do moving least square residue calculation
    float temp;
    for (int i = 0; i < len_z; i++)
    {
        for (int j = 0; j < nx; j++)
        {
            temp = x[j] - work[i + j];
            z[i] += temp * temp;
        }
    }
    // find the best fit value
    lstl2 = 1E30;
    shift = 0;
    for (int i = 0; i < len_z; i++)
    {
        if (z[i] < lstl2)
        {
            lstl2 = z[i];
            shift = i - maxshift;
        }
    }

    //end of program
    delete[] z;
    delete[] work;
}

I tested two arrays with exactly same length and same scale.

int shift; double lstl2;
func_lstl2(2000,z1,2000,z2,shift,lstl2) ;
func_lstl2(2000,x1,2000,x2,shift,lstl2) ;

For z array, it used 0.0032346 seconds, for x array, it used 0.0140903 seconds. I cannot figure out why there is near 5 times time consumption difference. Could you help me figure it out? Thank you very much! Here is the link for z array and x array. https://drive.google.com/file/d/1aONKTjE_7NI1bp8YkDL2CMfg9C5h67Fe/view?usp=sharing

YS_Jin
  • 55
  • 4
  • On a non real time OS you cannot be sure of the time taken by a function: there are so many other system (and not systems) processes which can interrupt your execution. Did you try to amke a statitisitcs running the functions tens of times and then make an average? – Marco Beninca Jul 18 '22 at 06:30
  • Yes, actually x and z arrays have 30 copies, I run all of them and all gave similar time consumption. – YS_Jin Jul 18 '22 at 06:33
  • You tested this with release-optimized toolchain configurations, *right* ?? – WhozCraig Jul 18 '22 at 06:34
  • Those two measurements, are they repeatable? How many samples did you take? What is the data in those arrays? BTW: You almost never need `new X[]`, use `vector` instead. – Ulrich Eckhardt Jul 18 '22 at 06:38
  • I tested with debug mode, not release mode. The reason why I use array is that this code is the part of a big project which involves both c and fortran code. fortran cannot deal with vector. – YS_Jin Jul 18 '22 at 06:43
  • The measurement are repeatable .I repeat the test with 30 different data set(30 x1 30 x2, 30 z1 30 z2). They are copies from a big data set. I only select one sample to demonstrate the time difference here. – YS_Jin Jul 18 '22 at 06:44
  • What confuses me is that, x and z array are exactly same length and same scale. They should be able to interchange. – YS_Jin Jul 18 '22 at 06:45

1 Answers1

2

I strongly suspect you're dealing with denormalized floating point calculation effects. Using your existing function, loading the values as-appropriate in vectors, and turning them loose seven times on the provided input, (compiled with -O3 optimization)

for (int i = 0; i < 5; ++i)
{
    int shift = 0;
    double lstl2 = 0;
    auto tp0 = steady_clock::now();
    func_lstl2(2000, v1.data(), 2000, v2.data(), shift, lstl2);
    auto tp1 = steady_clock::now();
    std::cout << pr[0] << ',' << pr[1] << ':';
    std::cout << duration_cast<milliseconds>(tp1 - tp0).count() << "ms\n";
}

I receive the following output, confirming your conundrum:

x1.txt,x2.txt:23ms
x1.txt,x2.txt:19ms
x1.txt,x2.txt:21ms
x1.txt,x2.txt:21ms
x1.txt,x2.txt:19ms
x1.txt,x2.txt:22ms
x1.txt,x2.txt:21ms
z1.txt,z2.txt:8ms
z1.txt,z2.txt:9ms
z1.txt,z2.txt:5ms
z1.txt,z2.txt:5ms
z1.txt,z2.txt:6ms
z1.txt,z2.txt:5ms
z1.txt,z2.txt:5ms

However, enabling denormalize-as-zero (DAZ) and flush-to-zero (FTZ) for floating calculations (the mechanism for doing so is toolchain-dependent; below is clang 13.01 on macOS):

_MM_SET_DENORMALS_ZERO_MODE(_MM_DENORMALS_ZERO_ON);
_MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_ON);

delivers the following:

x1.txt,x2.txt:4ms
x1.txt,x2.txt:4ms
x1.txt,x2.txt:3ms
x1.txt,x2.txt:5ms
x1.txt,x2.txt:3ms
x1.txt,x2.txt:3ms
x1.txt,x2.txt:5ms
z1.txt,z2.txt:7ms
z1.txt,z2.txt:6ms
z1.txt,z2.txt:4ms
z1.txt,z2.txt:3ms
z1.txt,z2.txt:3ms
z1.txt,z2.txt:4ms
z1.txt,z2.txt:3ms

Your x-data set is sensitive to this; z does not appear to be. See this question for a better explanation.

WhozCraig
  • 65,258
  • 11
  • 75
  • 141
  • Thank you WhozCraig!! I was confused by this problem for 3 days! You answer really teaches me a lesson on this! – YS_Jin Jul 18 '22 at 13:52