Calculation time peaks when operating on arrays of specific size

Question

I made a program, that allocates two arrays of C++ doubles. First contains values of sin(x) for x from 0 to pi/4. To the second one i write second derivatives of sin(x), calculated using three point formula:

(vals[i-1]-2*vals[i]+vals[i+1])/(dx*dx)

I perform those calculations for arrays of different size, repeat each a few times, and output average time of calculations. That way i get nice graph showing time needed to calculate derivatives for array of specific size.

It all sounds fairly easy, but I've encountered a weird problem. Take look at the graph: When arrays are small, nothing weird happens. However, when they are larger than 10000 elements (with two arrays that means about ~~16kB~~ 160kB of memory), I get periodical peaks, each about 512 elements after previous. This happens on multiple CPU's (AMD Sempron 3850, Intel Q9400 and Intel Xeon 5310), and doesn't happen on anothers (Intel i5 5200U, Intel Nehalem-C).

If that wasn't enough, when arrays get to about 65'000 elements, AMD 3850 on Windows 7 gets sudden increase in time. This doesn't happen on Debian.

I think it might be somehow related to CPU caching, given 'magical' numbers of elements, and to OS's scheduling, but I can't think of any specific explanation and way to confirm it (except for profiling CPU cache hit-miss ratio).

Code:

int main(int, char**)
    {   
    double maxerr=0.0;
    double avgerr=0.0;

    for(unsigned SIZE=5000u; SIZE<100000u; SIZE+=1u)
        {
        const unsigned REPEATS=10000000/SIZE+1;
        const double STEP=1.0/(SIZE-1.0)*atan(1.0)*4.0;

        double *vals;
        double *derivs;

        // Alokacja
        vals=  new double[SIZE];
        derivs=new double[SIZE];

        // Inicjalizacja
        for(unsigned i=0u; i<SIZE; ++i)
            {
            vals[i]=sin(i*STEP);
            derivs[i]=0.0;
            }

        // Obliczenia normalne
        const double TIME_FPU_START=msclock();

        for(unsigned r=0u; r<REPEATS; ++r)
            {
            const double STEP2RCP=1.0/(STEP*STEP);

            for(unsigned i=1u; i<SIZE-1u; ++i)
                {
                derivs[i]=(vals[i-1]-2.0*vals[i]+vals[i+1u])*STEP2RCP;
                }
            }

        const double TIME_FPU_END=msclock();
        const double TIME_FPU=(TIME_FPU_END-TIME_FPU_START)/REPEATS;

        maxerr=0.0;
        avgerr=0.0;

        for(unsigned i=1u; i<SIZE-1; ++i)
            {
            double err=fabs((derivs[i]+sin(i*STEP))/(-sin(i*STEP)));
            avgerr+=err/(SIZE-2);

            if(err>maxerr)
                {
                //printf("%f > %f\n", derivs[i], -sin(i*step));
                maxerr=err;
                }
            }

        const double MAX_REL_ERR_FPU=maxerr;
        const double AVG_REL_ERR_FPU=avgerr;

        // Podsumowanie
        fprintf(stdout, "%u  %.8f  %.8f\n", SIZE, TIME_FPU, AVG_REL_ERR_FPU);
        fprintf(stderr, "%u  %.8f  %.8f\n", SIZE, TIME_FPU, AVG_REL_ERR_FPU);

        // Kasowanie
        delete [] vals;
        delete [] derivs;
        }

    return 0;
    }

The "msclock" function is plain old clock()/CLOCKS_PER_SEC on linux, and QueryPerformanceTimer on Windows (clock() and all its variants on Windows seem to have resolution of about 16ms.

Data file: https://pastebin.com/4iDn5Y9k First column denotes array size, second one is average time and last is average error of derivative.

Additional info: I've checked addresses of arrays I'm getting. First one (vals) seems to constantly stay at the same value, while second one (derivs) is constantly increasing. This probably means that beginning of second array is in the same block as end of first array. With page size being 4kiB, increase of 512 elements might require using another page.

Additional info: On Debian peaks are closely related to addresses of arrays (columns 4 and 5):

50670  0.00021090  0.00000016  0x55efa5e04c30 0x55efa5e67bb0  
50680  0.00021244  0.00000017  0x55efa5e04c30 0x55efa5e67c00  
50690  0.00170968  0.00000023  0x7f1617d0b010 0x7f1617ca7010 @
50700  0.00021141  0.00000024  0x55efa5e04c30 0x55efa5e67ca0 @
50710  0.00021873  0.00000027  0x55efa5e04c30 0x55efa5e67cf0  
50720  0.00021126  0.00000002  0x55efa5e04c30 0x55efa5e67d40

Additional info: For arrays smaller than 10'000 elements, both arrays are always at the same address. For larger, second array keeps moving. That might be the reason that I'm not getting peaks for small sizes.

Also I've played around with "perf stat", as @geza suggested. First output when run for sizes from 5000 to 12930:

 Performance counter stats for './a.out':

      70733.635707      task-clock (msec)         #    0.999 CPUs utilized          
             6,008      context-switches          #    0.085 K/sec                  
                 3      cpu-migrations            #    0.000 K/sec                  
             3,866      page-faults               #    0.055 K/sec                  
    90,800,323,159      cycles                    #    1.284 GHz                      (50.00%)
                 0      stalled-cycles-frontend                                       (50.01%)
                 0      stalled-cycles-backend    #    0.00% backend cycles idle      (50.01%)
   160,806,960,842      instructions              #    1.77  insn per cycle                                              (50.00%)
    16,118,385,897      branches                  #  227.874 M/sec                    (50.00%)
         5,818,878      branch-misses             #    0.04% of all branches          (49.99%)

      70.839631335 seconds time elapsed

...And for sizes from 12940 to 30000:

 Performance counter stats for './a.out':

     157468.056544      task-clock (msec)         #    0.999 CPUs utilized          
            13,420      context-switches          #    0.085 K/sec                  
                10      cpu-migrations            #    0.000 K/sec                  
            32,271      page-faults               #    0.205 K/sec                  
   201,597,569,555      cycles                    #    1.280 GHz                      (50.00%)
                 0      stalled-cycles-frontend                                       (50.00%)
                 0      stalled-cycles-backend    #    0.00% backend cycles idle      (50.00%)
   351,405,781,351      instructions              #    1.74  insn per cycle                                              (50.00%)
    35,289,858,166      branches                  #  224.108 M/sec                    (50.00%)
        10,494,566      branch-misses             #    0.03% of all branches          (50.00%)

     157.692170504 seconds time elapsed

Program ran twice the time, so more two times more context-switches aren't surprising. On this range, however, I'm getting more page faults.

Another interesting thing, When I restarted the program for the second run, second array didn't move around the memory so much. Seems that it starts doing so when program has run for a while.

The longer I'm testing things, the less I know ;_;

Considering that the usual size for `double` is 8 bytes, then 10000 `double` elements would be 80000 bytes, not 16KiB. you're off by a factor of five. Not that it really matters though, not on anything remotely modern and non-embedded, as it's still a drop in the ocean. — Some programmer dude, Dec 13 '17 at 10:31
You may also find this intersting https://en.wikipedia.org/wiki/FLOPS — dsp_user, Dec 13 '17 at 11:04
I added the code. Double is 8 bytes, right, but i'm using two arrays at time, so together i need about 16kiB. Also, knowing the exact derivative allows me to check errors of my calculations, and that's why i'm using sin(x). As for benchmarking, I get the time, repeat calculations multiple times, get current time and calculate the average (total time/number of repeats) - standard way of doing things. — crueltear, Dec 13 '17 at 11:29
How many time do you repeat the loop? It could be just a context-switch if the timing is too short. It could also be the cost if the hardware auto prefetching data that is unused. — Liran Funaro, Dec 13 '17 at 12:12
FWIW, 10752 elements == 86016 B == 21 pages exactly. Not sure of the relevance of that specific number, but I imagine it's a clue. — Oliver Charlesworth, Dec 13 '17 at 12:15
@PaulR - "False sharing" is generally associated with readers/writers on multiple cores interacting badly with cache coherence (which isn't the case here). I think what you're describing is some form of aliasing/resonance. — Oliver Charlesworth, Dec 13 '17 at 12:39
@OliverCharlesworth: you are right, of course, I was thinking of a problem like this: https://stackoverflow.com/a/8547993 — PaulR, Dec 13 '17 at 12:44
Given the peaks are all at exact multiples of page size, I suspect that pathological aliasing of the two arrays is the cause. The only real mystery is why this behaviour is seemingly masked below 10752. — Oliver Charlesworth, Dec 13 '17 at 12:56
It has something to do with address alignment. If I align address to page size, then I can produce some weird behavior. Not exactly yours, but there is some irregularity. it linearly grows until 8692 (time: 4.4microsec), then it jumps to 6microsec until 8830, then falls back to 4.4 microsec. And there are irregularities at other sizes as well. (maybe the two arrays don't need to be page aligned, but have the same alignment in a page?) — geza, Dec 13 '17 at 13:33
@OliverCharlesworth For smaller arrays their addresses are always the same, not depending on the size. That might be the reason why there are no peaks. I don't know however what's so special about 10000 elements, from computer's perspective. — crueltear, Dec 13 '17 at 13:34
And it seems that it has something to do with SIMD too. If I disable it, I don't get irregularities. Call Mulder & Scully :) — geza, Dec 13 '17 at 13:47
@crueltear - Ok well that explains it then. Overall, the behaviour is being governed by the whims of the memory allocation algorithm (nothing to do with HW, presumably). Beyond 10k, you're seeing pathological aliasing every time your arrays exactly align with pages. — Oliver Charlesworth, Dec 13 '17 at 13:48
what happens if you perform a warm-up loop ? (that is, move the r=0 cycle of the innermost loop before clock() is taken); or alternatively, if you increase REPEATS ? — Massimiliano Janes, Dec 13 '17 at 13:56
@MassimilianoJanes - Bear in mind that there's an initialisation loop that runs beforehand, which ensures the entire working set is in L3. — Oliver Charlesworth, Dec 13 '17 at 14:13
@geza: Now that's weird, as i **don't** get peaks when using SIMD. I suggest calling exorcist. — crueltear, Dec 13 '17 at 17:58
@OliverCharlesworth sounds reasonable, but it seems that those peaks are somehow related to hardware. I'm getting them on Win7/Debian on Sempron 3850, and everything works nicely on (the same?) Win7 on i5 5200U. I guess that OS is mainly responsible for memory allocation? — crueltear, Dec 13 '17 at 18:07
Yeah indeed. My belief is the spikes are down to hardware behaviour, but the behaviour below 10k is simply down to the allocator. — Oliver Charlesworth, Dec 14 '17 at 00:52
(Transformed from the answer) "Looks for me, that it may be related to the heap allocation in the loop vals = new double[SIZE]; derivs = new double[SIZE]; Why not allocate all required memory of 100000u doubles one time before all the calculations and reuse the same array?" "allocating single array of 50000 doubles seems to remove all peaks, and times stay at the lower values. Seems that there is a problem when allocating arrays of specific size." – crueltear 2 — Spock77, Dec 16 '17 at 12:17

Calculation time peaks when operating on arrays of specific size

0 Answers0