18

I am performing a stencil computation on a matrix I previously read from a file. I use two different kinds of matrices (NonZero type and Zero type). Both types share the value of the boundaries (1000 usually), whilst the rest of the elements are 0 for Zero type and 1 for NonZero type.

The code stores the matrix of the file in two allocated matrices of the same size. Then it performs an operation in every element of one matrix using its own value and values of neighbours (add x 4 and mul x 1), and stores the result in the second matrix. Once the computation is finished, the pointers for matrices are swapped and the same operation is perform for a finite amount of times. Here you have the core code:

#define GET(I,J) rMat[(I)*cols + (J)]
#define PUT(I,J) wMat[(I)*cols + (J)]

for (cur_time=0; cur_time<timeSteps; cur_time++) {
    for (i=1; i<rows-1; i++) {
        for (j=1; j<cols-1; j++) {
            PUT(i,j) = 0.2f*(GET(i-1,j) + GET(i,j-1) + GET(i,j) + GET(i,j+1) + GET(i+1,j));
        }
    }
    // Change pointers for next iteration
    auxP = wMat;
    wMat = rMat;
    rMat = auxP;
}

The case I am exposing uses a fixed amount of 500 timeSteps (outer iterations) and a matrix size of 8192 rows and 8192 columns, but the problem persists while changing number of timeSteps or matrix size. Note that I only measure time of this concrete part of algorithm, so reading matrix from file nor anything else affects the time measure.

What it happens, is that I get different times depending on which type of matrix I use, obtaining a much worse performance when using Zero type (every other matrix performs same as NonZero type, as I have already tried to generate a matrix full of random values).

I am certain it is the multiplication operation, as if I remove it and leave only the adds, they perform the same. Note that with Zero matrix type, most of the type the result of the sum will be 0, so the operation will be "0.2*0".

This behaviour is certainly weird for me, as I thought that floating point operations were independent of values of operands, which does not look like the case here. I have also tried to capture and show SIGFPE exceptions in case that was the problem, but I obtained no results.

In case it helps, I am using an Intel Nehalem processor and gcc 4.4.3.

Stephen Canon
  • 103,815
  • 19
  • 183
  • 269
Nitros
  • 183
  • 1
  • 6
  • Are you testing this with hard coded data such that the compiler can _at compile time_ see the values in the matrix and make an inlined version of your code optimsied for that that fixed data? If it knows that an element is zero it can likely optimise out the multiply completely... – jcoder Mar 03 '11 at 11:40
  • 1
    Does it improve if you use `double` instead of `float` (for constants -- the `0.2f` -- and matrix values)? – pmg Mar 03 '11 at 11:50
  • 4
    How is the zero matrix initialized? In particular, are the zero true zeros or just very very small values appearing to be zero? Computation with subnormals (non zero values whose absolute value is smaller than FLT_MIN) are well known to often be slower than with normalized values. – AProgrammer Mar 03 '11 at 12:00
  • I think your main problem is cache pressure. Each of these matrices is 8192^2 * sizeof(float) large. That's well beyond L2, not to speak of L1 cache size. You should change your algorithm so that it operates on about chunks of 8k of data. Also I'd try to process those values using SIMD instructions. This looks like a prime example for using MAD instruction (Multiply Add). – datenwolf Mar 03 '11 at 12:30
  • @JohnB Data is not hard coded, I read it from files. I have also checked the assembler produced with objdump -S and the code of the inner loop looks pretty clear, 4 addss and 1 mulss with changes on the pointer to load next iterations, I could have not done it better in assembler. @pmg It surprisingly improves the result of the zero type matrix and makes the nonzero type perform worse than with floats, but still the nonzero type performs better than the zero type. – Nitros Mar 03 '11 at 12:43
  • @AProgrammer What you say makes a lot of sense. Due to the algorithm nature, the values on the boundaries are spread to the centre. Initially, they are large enough values, but after a few iterations the new spread values are closer to 0. This doesn't happen when I fill matrix with NonZero values like 1, as the result of the operation would be 1 at least. @datenwolf I know the problem of the algorithm is that it is memory bound, in fact I apply several optimisations to the code to improve performance. But that problem with floating point operations is independent of memory. – Nitros Mar 03 '11 at 12:52
  • @AProgrammer It worked, I have just changed 0.2f for 1.2f (then the result of the operation would never give a value below FLT_MIN) and then both matrices gave exactly the same time. Now I will have to find a source to explain that behaviour, but in case I could not find it, I could always use as a proof the experimental result obtained changing the constant. Thank you all very much. – Nitros Mar 03 '11 at 13:15
  • @AProgrammer, you should add your comment as an answer so it can be accepted. – AShelly Mar 03 '11 at 18:41

2 Answers2

19

The problem has already mostly been diagnosed, but I will write up exactly what happens here.

Essentially, the questioner is modeling diffusion; an initial quantity on the boundary diffuses into the entirety of a large grid. At each time step t, the value at the leading edge of the diffusion will be 0.2^t (ignoring effects at the corners).

The smallest normalized single-precision value is 2^-126; when cur_time = 55, the value at the frontier of the diffusion is 0.2^55, which is a bit smaller than 2^-127. From this time step forward, some of the cells in the grid will contain denormal values. On the questioner's Nehalem, operations on denormal data are about 100 times slower than the same operation on normalized floating point data, explaining the slowdown.

When the grid is initially filled with constant data of 1.0, the data never gets too small, and so the denormal stall is avoided.

Note that changing the data type to double would delay, but not alleviate the issue. If double precision is used for the computation, denormal values (now smaller than 2^-1022) will first arise in the 441st iteration.

At the cost of precision at the leading edge of the diffusion, you could fix the slowdown by enabling "Flush to Zero", which causes the processor to produce zero instead of denormal results in arithmetic operations. This is done by toggling a bit in the FPSCR or MXSCR, preferably via the functions defined in the <fenv.h> header in the C library.

Another (hackier, less good) "fix" would be to fill the matrix initially with very small non-zero values (0x1.0p-126f, the smallest normal number). This would also prevent denormals from arising in the computation.

Stephen Canon
  • 103,815
  • 19
  • 183
  • 269
  • Should work too: `double FlushToZero(double x){return (x < 2e-126)?0:x;}` and call it: `y = FlushToZero(0.2 * ...);`. Should contain no overhead with proper inlining of the compiler. – Xeo Mar 03 '11 at 21:37
  • @Xeo: I suspect most compilers will (rightly or wrongly) compile that into a compare and branch, which will introduce overhead, but it's still a reasonable suggestion. Actually enabling the FTZ bit is the only way to do it with zero overhead. Also, the constant should be `0x1.0p-126f`, not `2e-126` (a rather different value). – Stephen Canon Mar 03 '11 at 21:41
  • Woops, you're right with the constant, but my 5min for edits have passed. :| Fun fact: `0x1.0p` is `3,1415926535897932384626433832795`. Found that while pasting `0x1.0p-126` into Windows Calc. – Xeo Mar 03 '11 at 21:47
  • But what is the scope of this setting (in ) ? Can it be turned off and on around specific calculations whose near-zero accuracy isn't important? and what is the processor behavior in other threads in my process, when they perform such calculations? – Motti Shneor Dec 28 '15 at 22:46
  • @MottiShneor: The setting is per-thread. If the near-zero accuracy is unimportant, there's no reason to bother turning it off, because that's all that it effects. – Stephen Canon Dec 28 '15 at 22:49
0

Maybe your ZeroMatrix uses the typical storage scheme for Sparse Matrices: store every non-zero value in a linked list. If that is the case, it is quite understandable why it performs worse than a typical array-based storage-scheme: because it needs to run thru the linked list once for every operation you perform. In that case you can maybe speed the process up by using a matrix-multiply-algorithm that accounts for having a sparse-matrix. If this is not the case please post minimal but complete code so we can play with it.

here is one of the possibilities for multiplying sparse matrices efficiently:

http://www.cs.cmu.edu/~scandal/cacm/node9.html

Bernd Elkemann
  • 23,242
  • 4
  • 37
  • 66