I made a program, that allocates two arrays of C++ doubles. First contains values of sin(x) for x from 0 to pi/4. To the second one i write second derivatives of sin(x), calculated using three point formula:
(vals[i-1]-2*vals[i]+vals[i+1])/(dx*dx)
I perform those calculations for arrays of different size, repeat each a few times, and output average time of calculations. That way i get nice graph showing time needed to calculate derivatives for array of specific size.
It all sounds fairly easy, but I've encountered a weird problem. Take look at the graph:
When arrays are small, nothing weird happens. However, when they are larger than 10000 elements (with two arrays that means about
16kB 160kB of memory), I get periodical peaks, each about 512 elements after previous. This happens on multiple CPU's (AMD Sempron 3850, Intel Q9400 and Intel Xeon 5310), and doesn't happen on anothers (Intel i5 5200U, Intel Nehalem-C).
If that wasn't enough, when arrays get to about 65'000 elements, AMD 3850 on Windows 7 gets sudden increase in time. This doesn't happen on Debian.
I think it might be somehow related to CPU caching, given 'magical' numbers of elements, and to OS's scheduling, but I can't think of any specific explanation and way to confirm it (except for profiling CPU cache hit-miss ratio).
Code:
int main(int, char**)
{
double maxerr=0.0;
double avgerr=0.0;
for(unsigned SIZE=5000u; SIZE<100000u; SIZE+=1u)
{
const unsigned REPEATS=10000000/SIZE+1;
const double STEP=1.0/(SIZE-1.0)*atan(1.0)*4.0;
double *vals;
double *derivs;
// Alokacja
vals= new double[SIZE];
derivs=new double[SIZE];
// Inicjalizacja
for(unsigned i=0u; i<SIZE; ++i)
{
vals[i]=sin(i*STEP);
derivs[i]=0.0;
}
// Obliczenia normalne
const double TIME_FPU_START=msclock();
for(unsigned r=0u; r<REPEATS; ++r)
{
const double STEP2RCP=1.0/(STEP*STEP);
for(unsigned i=1u; i<SIZE-1u; ++i)
{
derivs[i]=(vals[i-1]-2.0*vals[i]+vals[i+1u])*STEP2RCP;
}
}
const double TIME_FPU_END=msclock();
const double TIME_FPU=(TIME_FPU_END-TIME_FPU_START)/REPEATS;
maxerr=0.0;
avgerr=0.0;
for(unsigned i=1u; i<SIZE-1; ++i)
{
double err=fabs((derivs[i]+sin(i*STEP))/(-sin(i*STEP)));
avgerr+=err/(SIZE-2);
if(err>maxerr)
{
//printf("%f > %f\n", derivs[i], -sin(i*step));
maxerr=err;
}
}
const double MAX_REL_ERR_FPU=maxerr;
const double AVG_REL_ERR_FPU=avgerr;
// Podsumowanie
fprintf(stdout, "%u %.8f %.8f\n", SIZE, TIME_FPU, AVG_REL_ERR_FPU);
fprintf(stderr, "%u %.8f %.8f\n", SIZE, TIME_FPU, AVG_REL_ERR_FPU);
// Kasowanie
delete [] vals;
delete [] derivs;
}
return 0;
}
The "msclock" function is plain old clock()/CLOCKS_PER_SEC on linux, and QueryPerformanceTimer on Windows (clock() and all its variants on Windows seem to have resolution of about 16ms.
Data file: https://pastebin.com/4iDn5Y9k First column denotes array size, second one is average time and last is average error of derivative.
Additional info: I've checked addresses of arrays I'm getting. First one (vals) seems to constantly stay at the same value, while second one (derivs) is constantly increasing. This probably means that beginning of second array is in the same block as end of first array. With page size being 4kiB, increase of 512 elements might require using another page.
Additional info: On Debian peaks are closely related to addresses of arrays (columns 4 and 5):
50670 0.00021090 0.00000016 0x55efa5e04c30 0x55efa5e67bb0
50680 0.00021244 0.00000017 0x55efa5e04c30 0x55efa5e67c00
50690 0.00170968 0.00000023 0x7f1617d0b010 0x7f1617ca7010 @
50700 0.00021141 0.00000024 0x55efa5e04c30 0x55efa5e67ca0 @
50710 0.00021873 0.00000027 0x55efa5e04c30 0x55efa5e67cf0
50720 0.00021126 0.00000002 0x55efa5e04c30 0x55efa5e67d40
Additional info: For arrays smaller than 10'000 elements, both arrays are always at the same address. For larger, second array keeps moving. That might be the reason that I'm not getting peaks for small sizes.
Also I've played around with "perf stat", as @geza suggested. First output when run for sizes from 5000 to 12930:
Performance counter stats for './a.out':
70733.635707 task-clock (msec) # 0.999 CPUs utilized
6,008 context-switches # 0.085 K/sec
3 cpu-migrations # 0.000 K/sec
3,866 page-faults # 0.055 K/sec
90,800,323,159 cycles # 1.284 GHz (50.00%)
0 stalled-cycles-frontend (50.01%)
0 stalled-cycles-backend # 0.00% backend cycles idle (50.01%)
160,806,960,842 instructions # 1.77 insn per cycle (50.00%)
16,118,385,897 branches # 227.874 M/sec (50.00%)
5,818,878 branch-misses # 0.04% of all branches (49.99%)
70.839631335 seconds time elapsed
...And for sizes from 12940 to 30000:
Performance counter stats for './a.out':
157468.056544 task-clock (msec) # 0.999 CPUs utilized
13,420 context-switches # 0.085 K/sec
10 cpu-migrations # 0.000 K/sec
32,271 page-faults # 0.205 K/sec
201,597,569,555 cycles # 1.280 GHz (50.00%)
0 stalled-cycles-frontend (50.00%)
0 stalled-cycles-backend # 0.00% backend cycles idle (50.00%)
351,405,781,351 instructions # 1.74 insn per cycle (50.00%)
35,289,858,166 branches # 224.108 M/sec (50.00%)
10,494,566 branch-misses # 0.03% of all branches (50.00%)
157.692170504 seconds time elapsed
Program ran twice the time, so more two times more context-switches aren't surprising. On this range, however, I'm getting more page faults.
Another interesting thing, When I restarted the program for the second run, second array didn't move around the memory so much. Seems that it starts doing so when program has run for a while.
The longer I'm testing things, the less I know ;_;