I am looking at the following code and I noticed something strange when timing its performance.
For the record, I am doing this in Visual Studio 2010, Windows 7 x64, -O2 optimization on, and in Release mode. My processor is an Intel i5.
There is a section in code where memory gets written to. I used to do it this way:
d_res_matrix[x][y] = a;
In this case, executing the entire program takes about 2.3s. I was mucking about with the code trying to make it faster, and I did this:
d_res_matrix[x][y] = a + 0.00000001;
which executes in 0.4s! That is a huge difference, but I am not sure why that would happen.
To me it would make sense if it was slower since the extra addition operation takes time. I guess my alternative hypothesis would be that doing the addition somehow forces the compiler to SIMD this operation (fetch, add and write?). Maybe the write otherwise stalls the pipeline but this manages to prevent that? Any ideas?
Edit (Apr 6, 6:19): The issue is the same on my home computer (Visual Studio 2012).
Edit (Apr 6, 6:38): The issue also exists in Visual Studio 2008 (-O2, Release). In Debug they are both slow, but same slowness.
Edit (Apr 8, 1:28): I installed the Intel Parallel Studio XE (I'm a student), and it showed me lots of good stuff - for one, I never actually deleted the arrays that I declared (I'm not fixing it now, but be warned). However, freeing the memory didn't actually solve anything. As Richard outlines in the answer, the entire issue was caused by denormal floating point values (see more information here). The FP units cannot handle the denormal values properly and microcode sequences are launched instead, which are very slow.
#include <time.h>
#include <stdio.h>
#include <cstdlib>
#include <stdlib.h>
#define DIM 1000
#define ITERATIONS 100
#define CPU_START clock_t t1; t1=clock();
#define CPU_END {long int final=clock()-t1; printf("CPU took %li ticks (%f seconds) \n", final, ((float)final)/CLOCKS_PER_SEC);}
int main(void)
{
double ** d_matrix, ** d_res_matrix;
d_res_matrix = new double * [DIM];
d_matrix = new double * [DIM];
for (int i = 0; i < DIM; i++)
{
d_matrix[i] = new double [DIM];
d_res_matrix[i] = new double[DIM];
}
d_matrix[20][45] = 1; // start somewhere
double f0, f1, f2, f3, f4;
CPU_START;
for (int iter = 0; iter < ITERATIONS; iter++)
{
for (int x = 1; x < DIM-1; x++) // avoid boundary cases for this example
{
for (int y = 1; y < DIM-1; y++)
{
f0 = d_matrix[x][y];
f1 = d_matrix[x-1][y];
f2 = d_matrix[x+1][y];
f3 = d_matrix[x][y-1];
f4 = d_matrix[x][y+1];
double a = f0*0.6 + f1*0.1 + f2*0.1 + f3*0.1 + f4*0.1;
// THIS PART IS INTERESTING:
//d_res_matrix[x][y] = a;
d_res_matrix[x][y] = a + 0.000000001;
}
}
for (int x = 1; x < DIM-1; x++)
{
for (int y = 1; y < DIM-1; y++)
{
d_matrix[x][y] = d_res_matrix[x][y];
}
}
}
CPU_END;
return 0;
}
Here are some screenshots of the output to show that this isn't a one time occurrence: NO MORE SCREENSHOTS :D :D :D :D :D Here is some text instead!
no addition:
CPU took 3585 ticks <3.585000 seconds>
CPU took 3592 ticks <3.592000 seconds>
CPU took 3430 ticks <3.430000 seconds>
CPU took 2032 ticks <2.032000 seconds>
CPU took 3117 ticks <3.117000 seconds>
CPU took 2050 ticks <2.050000 seconds>
CPU took 3266 ticks <3.266000 seconds>
CPU took 3394 ticks <3.394000 seconds>
CPU took 3446 ticks <3.446000 seconds>
CPU took 3131 ticks <3.131000 seconds>
with addition:
CPU took 430 ticks <0.430000 seconds>
CPU took 428 ticks <0.428000 seconds>
CPU took 470 ticks <0.470000 seconds>
CPU took 470 ticks <0.470000 seconds>
CPU took 470 ticks <0.470000 seconds>
CPU took 470 ticks <0.470000 seconds>
CPU took 460 ticks <0.460000 seconds>
CPU took 471 ticks <0.471000 seconds>
CPU took 471 ticks <0.471000 seconds>
CPU took 460 ticks <0.460000 seconds>