How is an addition operation speeding up this memory access?

Question

I am looking at the following code and I noticed something strange when timing its performance.

For the record, I am doing this in Visual Studio 2010, Windows 7 x64, -O2 optimization on, and in Release mode. My processor is an Intel i5.

There is a section in code where memory gets written to. I used to do it this way:

d_res_matrix[x][y] = a;

In this case, executing the entire program takes about 2.3s. I was mucking about with the code trying to make it faster, and I did this:

d_res_matrix[x][y] = a + 0.00000001;

which executes in 0.4s! That is a huge difference, but I am not sure why that would happen.

To me it would make sense if it was slower since the extra addition operation takes time. I guess my alternative hypothesis would be that doing the addition somehow forces the compiler to SIMD this operation (fetch, add and write?). Maybe the write otherwise stalls the pipeline but this manages to prevent that? Any ideas?

Edit (Apr 6, 6:19): The issue is the same on my home computer (Visual Studio 2012).

Edit (Apr 6, 6:38): The issue also exists in Visual Studio 2008 (-O2, Release). In Debug they are both slow, but same slowness.

Edit (Apr 8, 1:28): I installed the Intel Parallel Studio XE (I'm a student), and it showed me lots of good stuff - for one, I never actually deleted the arrays that I declared (I'm not fixing it now, but be warned). However, freeing the memory didn't actually solve anything. As Richard outlines in the answer, the entire issue was caused by denormal floating point values (see more information here). The FP units cannot handle the denormal values properly and microcode sequences are launched instead, which are very slow.

#include <time.h>
#include <stdio.h>
#include <cstdlib>
#include <stdlib.h>

#define DIM 1000
#define ITERATIONS 100

#define CPU_START clock_t t1; t1=clock();
#define CPU_END {long int final=clock()-t1; printf("CPU took %li ticks (%f seconds) \n", final, ((float)final)/CLOCKS_PER_SEC);}

int main(void)
{
    double ** d_matrix, ** d_res_matrix;

    d_res_matrix = new double * [DIM];
    d_matrix = new double * [DIM];

    for (int i = 0; i < DIM; i++)
    {
        d_matrix[i] = new double [DIM];
        d_res_matrix[i] = new double[DIM];
    }

    d_matrix[20][45] = 1; // start somewhere

    double f0, f1, f2, f3, f4;

    CPU_START;

    for (int iter = 0; iter < ITERATIONS; iter++)
    {
        for (int x = 1; x < DIM-1; x++) // avoid boundary cases for this example
        {
            for (int y = 1; y < DIM-1; y++)
            {
                f0 = d_matrix[x][y];
                f1 = d_matrix[x-1][y];
                f2 = d_matrix[x+1][y];
                f3 = d_matrix[x][y-1];
                f4 = d_matrix[x][y+1];

                double a = f0*0.6 + f1*0.1 + f2*0.1 + f3*0.1 + f4*0.1;

                // THIS PART IS INTERESTING:
                //d_res_matrix[x][y] = a;
                d_res_matrix[x][y] = a + 0.000000001;

            }
        }
        for (int x = 1; x < DIM-1; x++)
        {
            for (int y = 1; y < DIM-1; y++)
            {
                d_matrix[x][y] = d_res_matrix[x][y];
            }
        }
    }
    CPU_END; 

    return 0;
}

Here are some screenshots of the output to show that this isn't a one time occurrence: NO MORE SCREENSHOTS :D :D :D :D :D Here is some text instead!

no addition:

CPU took 3585 ticks <3.585000 seconds>
CPU took 3592 ticks <3.592000 seconds>
CPU took 3430 ticks <3.430000 seconds>
CPU took 2032 ticks <2.032000 seconds>
CPU took 3117 ticks <3.117000 seconds>
CPU took 2050 ticks <2.050000 seconds>
CPU took 3266 ticks <3.266000 seconds>
CPU took 3394 ticks <3.394000 seconds>
CPU took 3446 ticks <3.446000 seconds>
CPU took 3131 ticks <3.131000 seconds>

with addition:

CPU took 430 ticks <0.430000 seconds>
CPU took 428 ticks <0.428000 seconds>
CPU took 470 ticks <0.470000 seconds>
CPU took 470 ticks <0.470000 seconds>
CPU took 470 ticks <0.470000 seconds>
CPU took 470 ticks <0.470000 seconds>
CPU took 460 ticks <0.460000 seconds>
CPU took 471 ticks <0.471000 seconds>
CPU took 471 ticks <0.471000 seconds>
CPU took 460 ticks <0.460000 seconds>

Pretty much the same [Addition](http://coliru.stacked-crooked.com/a/146a28de519a4d9c), and [No Addition](http://coliru.stacked-crooked.com/a/95544be5f26a1177). I suspect the first one you ran was slower, and the second was faster, regardless of the code. — Mooing Duck, Apr 06 '15 at 22:53
@MooingDuck, your links don't seem to work for me (they are "loading" for what seems like ever). Not sure about the rest of your response. Do you mean that I ran it twice and just by luck the second time was faster? If so, I've run it many times by now and the case with addition is consistently faster. I can upload a collage of screenshots if you'd like. — Mewa, Apr 06 '15 at 22:57
"Does this happen on your computer too?" is an opinion poll and not a valid question. Fundamentally, assuming that you didn't cock anything up and this is a real thing, it's clearly an edge case/bug in the MSVC10 optimizer and nothing more. — Puppy, Apr 06 '15 at 23:13
@Puppy, that is not the only thing I asked. There are other sentences in my post. — Mewa, Apr 06 '15 at 23:15
@Mewa: if people are misunderstanding what the question is, maybe you should edit the post to make it clear _exactly_ what the question is. Also, remove the opinion poll. (The Coliru website is slow atm, sorry about that. but it shows the same times for both in G++ plus or minus 10%) — Mooing Duck, Apr 06 '15 at 23:18
@MooingDuck, well you got the question correctly. SO is so trigger-happy with the close votes though, sheesh. I provided code, screenshots of the issue, tested it on two different computers (different MSVC versions) and it's still not enough to make people happy. It may be somehow obvious to some people that this is an edge case, but it is not to me. — Mewa, Apr 07 '15 at 00:18
@Mewa: Keep in mind: SO is not a help desk. We're not here to help you. We're here to build a FAQ. If a question cannot be reasonably answered as defined in the [help](http://stackoverflow.com/help/asking), then it doesn't belong on the site. However, you appear to now have enough information that we can make a proper examination and come up with reasonable answers. — Mooing Duck, Apr 07 '15 at 00:21
Posting an image consisting of 20 impossible to read images is simply wrong. Have you not learned how to mark and copy text from the command window? Images should only be used when absolutely necessary to illustrate a point that can not be explained in text. They're difficult to read (especially from a mobile device), often can't be accessed from behind corporate proxy servers, and have a financial impact on users who pay for data use. — Ken White, Apr 07 '15 at 00:53
@KenWhite, if you right click and open it in a new tab, it is perfectly easy to read. I **have** learned how to copy paste actually, how considerate of you to question my education! The image was proof that this was not a one time occurrence. I converted it to text just for you :) — Mewa, Apr 07 '15 at 01:34
I did. And your comment isn't very constructive in the least (you're very condescending). I already read what you said, so why repeat it again? I already changed the post to suit your needs better. PS: To right-click/view in new tab on mobile, you can just press the image and hold your finger until a menu pops up, then select "Open in new tab". — Mewa, Apr 07 '15 at 01:44
People don't need images to believe you. A clear textual output is enough. And you may need to check the assembly output — phuclv, Apr 07 '15 at 03:47
@LưuVĩnhPhúc, assembly output was the same in both cases. It was **not** an "edge case/bug" as Puppy suggested. Also I'm surprised noone noticed (and got upset over) me not delete[]ing the arrays. — Mewa, Apr 08 '15 at 19:25

score 5 · Accepted Answer · answered Apr 08 '15 at 18:52

5

You're probably generating denormals on the first run through, which the addition will avoid. Subsequent operations on these denormals could be extremely expensize. In debug mode, your data will be 0 initialized, but this won't happen in release, so the values you're operating on could be anything. I'd be surprised if you still see this behavior if explicitly memset d_matrix to 0 after it's allocated.

answered Apr 08 '15 at 18:52

Richard Byron

321
2
4

Thanks Richard! That was it :D I was in the process of profiling my code using VTune, and noticed an awful lot of FP_ASSIST events on inputs were happening (about 200 million). I traced this over to an explanation on the Intel website ([here](https://software.intel.com/sites/products/documentation/doclib/iss/2013/amplifier/lin/ug_docs/GUID-3163CF43-59E1-465A-9507-9D3D957D9652.htm)), which mentions denormal floating point values as the cause for significant performance loss. I hadn't linked the two together yet, but now it makes perfect sense. – Mewa Apr 08 '15 at 19:20
so it's basically the same like this http://stackoverflow.com/questions/9314534/why-does-changing-0-1f-to-0-slow-down-performance-by-10x I didn't noticed that the OP hasn't initialized the values – phuclv Apr 09 '15 at 03:33

How is an addition operation speeding up this memory access?

1 Answers1