different results with and without SSE ( float arrays multiplication)

Question

I have two functions of 2d arrays multiplication. One of them with SSE. Another function without any optimization. Both functions work well. But the results are slightly different. For example 20.333334 and 20.333332.

Can you explain why the results are different? And what can I do with functions to have the same result?

function with SSE

float** sse_multiplication(float** array1, float** array2, float** arraycheck)
{
    int i, j, k;
    float *ms1, *ms2, result;
    float *end_loop;

    for( i = 0; i < rows1; i++)
    {
        for( j = 0; j < columns2; j++)
        {
            result = 0;
            ms1 = array1[i];
            ms2 = array2[j];
            end_loop = &array1[i][columns1];

            __asm{
                     mov rax, ms1
                     mov rbx, ms2
                     mov rdx, end_loop
                     xorps xmm2, xmm2

                loop:
                     movups xmm0, [rax]
                     movups xmm1, [rbx]
                     movups xmm3, [rax+16]
                     movups xmm4, [rbx+16]

                     mulps xmm0, xmm1
                     mulps xmm3, xmm4

                     addps xmm2, xmm0

                     add rax, 32
                     add rbx, 32

                     cmp rdx, rax
                     jne loop

                     haddps xmm2, xmm2
                     haddps xmm2, xmm2

                     movups result, xmm2
               }

             arraycheck[i][j] = result;
        }
    }
    return arraycheck;
}

function without any optimization

float** multiplication(float** array1, float** array2, float** arraycheck)
{
    for (int i = 0; i < rows1; i++)
        for (int j = 0; j < columns2; j++)
            for (int k = 0; k < rows1; k++)
                arraycheck[i][j] += array1[i][k] * array2[k][j];

    return arraycheck;
}

Please show us how exactly you compile. What compiler do you use? What version? What flags? Also, please choose one of C and C++ for this question. If you are programming in C, choose C. If you are programming in C++, choose C++. Do not tag both. — fuz, Apr 09 '16 at 12:42
If I'm not mistaken, the order of operations is different. In the non-SSE version, you add the products to the result one-by-one, but in the SSE version, you accumulate 4 different sums of products, and then sum those 4 to obtain the result. This might be the reason for the slight difference. — Rostislav, Apr 09 '16 at 12:51
@AntonSavich Not even with any optimizer flags? Not even with `-c` to make an object file? I just try to make this question reproducible which it isn't. — fuz, Apr 09 '16 at 12:56
You should use compiler intrinsics instead of inline assembly. It seems like you compile with clang/gcc which need declaration of register usage which you don't do in your assembly block. — Daniel, Apr 09 '16 at 13:14
You'll get better results from using intrinsics, so the compiler knows what's going on and can optimize everything. It doesn't "understand" the asm block, so it has to save/restore registers around it, etc. — Peter Cordes, Apr 10 '16 at 02:58

score 3 · Accepted Answer · edited Jun 20 '20 at 09:12

FP addition is not perfectly associative, so a different order of operations produces slightly different rounding errors.

Your C sums elements in order. (Unless you use -ffast-math to allow the compiler to make the same assumption you did, that FP operations are close enough to associative).

Your asm sums up every 4th element at 4 different offsets, then horizontally sums those. The sum in each vector element is rounded differently at each point.

Your vectorized version doesn't seem to match the C version. The indexing looks different. AFAICT, the only sane way to vectorize arraycheck[i][j] += array1[i][k] * array2[k][j]; is over j. Looping over k would require strided loads from array2, and looping over i would require strided loads from array1.

Am I missing something about your asm? It's loading contiguous values from both arrays. It's also throwing away the mulps result in xmm3 every iteration of loop, so I think it's just buggy.

Since looping over j in the inner vector loop doesn't change array1[i][k], just broadcast-load it once outside the loop (_mm256_set1_ps).

However, that means doing a read-modify-write of arraycheck[i][j] for every different j value. i.e. ac[i][j + 0..3] = fma(a1[i][k], a2[k][j + 0..3], ac[i][j + 0..3]). To avoid this, you'd have to transpose one of the arrays first. (But that's O(N^2) for an NxN matrix, which is still cheaper than multiply).

This way doesn't use a horizontal sums, but see that link if you want better code for that.

It does operations in the same order as the scalar C, so results should match exactly.

Also note that you need to use more than one accumulator to saturate the execution units of a CPU. I'd suggest 8, to saturate Skylake's 4c latency, one per 0.5c throughput addps. Haswell has 3c latency, one per 1c addps, but Skylake dropped the separate FP add unit and does it in the FMA unit. (See the x86 tag wiki, esp. Agner Fog's guides)

Actually, since my suggested change doesn't use a single accumulator at all, every loop iteration accesses independent memory. You'll need a bit of loop unrolling to saturate the FP execution units with two loads and store in the loop (even though you only need two pointers, since the store is back into the same location as one of the loads). But anyway, if your data fits in L1 cache, out-of-order execution should keep the execution units pretty well supplied with work from separate iterations.

If you really care about performance, you'll make an FMA version, and maybe an AVX-without-FMA version for Sandybridge. You could be doing two 256b FMAs per clock instead of one 128b add and mul per clock. (And of course you're not even getting that, because you bottleneck on latency unless the loop is short enough for the out-of-order window to see independent instructions from the next iteration).

You're going to need "loop tiling", aka "cache blocking" to make this not suck for big problem sizes. This is a matrix multiply, right? There are very good libraries for this, which are tuned for cache sizes, and will beat the pants off a simple attempt like this. e.g. ATLAS was good last time I checked, but that was several years ago.

Use intrinsics, unless you write the whole function in asm. Compilers "understand" what they do, so can make nice optimizations, like loop unrolling when appropriate.

@PascalCuoq: Unless I'm mistaken, the OP's asm didn't match his C at all, unless he was passing a transposed copy of one of the arrays to the asm version. Had to make a big update to my answer, so pinging you in case you're curious. — Peter Cordes, Apr 10 '16 at 06:04
@PeterCordes Yes, you are right. I am passing a transposed array to a function with SSE. — sav_a, Apr 11 '16 at 04:59
@AntonSavich: transposing the whole array at once is limited by memory bandwidth. Much better to only transpose a part that fits in L1 or L2 cache and operate on that while it's still hot. I think this is the standard cache-blocking technique for matmuls. Again, highly recommend using an optimized matmul library (you're looking for the BLAS `DGEMM` routine). — Peter Cordes, Apr 11 '16 at 05:02

Lotayou · Answer 2 · 2016-04-09T13:08:32.073

1

According to IEEE standard Formats, 32-bit float can only guanartee 6-7 digits accuracy. Your error is so marginal that no plausible claim can be made on compiler's mechanism. If you want to achieve better precision, it would be wise to choose either 64-bit double(guarentees 15 digits accuracy) or implement your own BigDecimal class like java do.

edited Apr 09 '16 at 13:08

answered Apr 09 '16 at 13:02

Lotayou

162
11

1

I disagree with the down-voter, but I suspect it's because this doesn't *really* answer the question. Floating point arithmetic *is* deterministic - it's just a lot more complicated than integer. I suspect the OP wants to write some "simple" C++ code to check their optimized SSE code. (This is actually very hard, and there will be so many casts, it may not be all that simple). – Martin Bonner supports Monica Apr 09 '16 at 13:12

different results with and without SSE ( float arrays multiplication)

2 Answers2

FP addition is not perfectly associative, so a different order of operations produces slightly different rounding errors.

Linked