What is causing different results when summing two arrays of float numbers using SSE SIMD in C++ versus Go assembly?

Question

I wrote a little function, just for example, of sum two arrays of float numbers using cted to SSE SIMD vector oerations in C++, but i didn't expected to see the same result in Go lang assembly? What is wrong?

Common pseudocode looks like:

a := [N]float{}
b := [N]float{}
sum := 0
for i < N do
    sum += a[i] + b[i]
endfor

vectorize SSE SIMD C++:

#include <iostream>
#include <intrin.h>

using namespace std;

float sum(const float* a, const float* b, unsigned int len) {
    __m128 sum = _mm_set1_ps(0);
    for (int i = 0; i < len; i += 4) {
         __m128 sseA = _mm_loadu_ps(&a[i]);
         __m128 sseB = _mm_loadu_ps(&b[i]);

         sum = _mm_add_ps(sseA, sum);
         sum = _mm_add_ps(sseB, sum);
    }
    sum = _mm_hadd_ps(sum, sum);
    sum = _mm_hadd_ps(sum, sum);
    return _mm_cvtss_f32(sum);
}

int main() {
   size_t N = 7
   float *a = (float *)malloc(sizeof(float)*N);
   a[0] = 1;
   a[1] = 2;
   a[2] = 3.21;
   a[3] = 3.5;
   a[4] = 3.1;
   a[5] = 3.2;
   a[6] = 0.2;
   float *b = (float *)malloc(sizeof(float)*N);
   b[0] = 1;
   b[1] = 2;
   b[2] = 3.2;
   b[3] = 5.5;
   b[4] = 3.1;
   b[5] = 3.1;
   b[6] = 0.21;
   float ret = sum(a, b, N);
   cout<<ret;
   return 0;
}

This code output: 33.32

In Go lang i wrote the same function but i couldn't see the same result:

Go assembly file:

#include "textflag.h"

//func _sum(a []float32, b[]float32) float32
TEXT _sum(SB), NOSPLIT, $0-52
   MOVQ a_base+0(FP), AX
   MOVQ b_base+24(FP), CX
   MOVQ a_len+8(FP), DX

   VXORPS X0, X0, X0 

loop:
   CMPQ DX, $0
   JE   back
   VMOVUPS (AX), X1

   VADDPS (CX), X1, X1
   VHADDPS X1, X1, X1
   VHADDPS X1, X1, X1
   
   VADDPS X1, X0, X0
   
   ADDQ $10, AX
   ADDQ $10, CX
   SUBQ $4, DX
   JMP loop

back:
   MOVSS X0, ret+48(FP)
   RET

Go main file:

package main

import "fmt"

//go:noescape
func _sum(a []float32, b []float32) float32

func main() {
   a := []float32{1, 2, 3.21, 3.5, 3.1, 3.2, 0.2}
   b := []float32{1, 1, 3.2, 5.5, 3.1, 3.1, 0.21}
   ret := _sum(a, b)
   fmt.Println(ret)
}

This code output: 1.1076422e+09

32 bit floats versus 128 bit floats... that must make a difference in precission right? So yes I would expect rounding differences — Pepijn Kramer, Jun 01 '23 at 11:30
We can see on 128 bit as 4 * 32 bit. Not doing one by one but 4 together. Now it's not just a rounding diff. It's totally difference result in Go than in cpp — ANTON SAVELIEV, Jun 01 '23 at 11:49
please show a [mre] including the inputs, outputs and expected outputs — Alan Birtles, Jun 01 '23 at 11:52
Printing outputs tells us whether either you are calculating different things, or whether they are slightly different rounding errors. — gnasher729, Jun 01 '23 at 12:00
Sorry I misread your avx assembly (indeed 32 bit floats too) — Pepijn Kramer, Jun 01 '23 at 12:08
Your C++ code is buggy. First of all, you don't want to do a horizontal sum inside the inner loop. Just use `vaddps` inside the loop, hsum once at the end (see [Fastest way to do horizontal SSE vector sum (or other reduction)](https://stackoverflow.com/q/6996764) for how to do it efficiently.) Also, you C++ code is doing a 16-byte store to a 4-byte scalar local variable, overwriting 12 bytes of stack space past the end of `sum`. — Peter Cordes, Jun 01 '23 at 14:35
Are you sure the Go asm is storing the return value to the right place? Is the value in XMM0 at the end of that inefficient loop correct? If so, then it's a calling convention problem. Keep single-stepping and look at what the caller loads from where. (Perhaps the caller passed a pointer?) Does the Go assembler magically invent instructions to set up FP as a frame pointer on function entry and tear it down before `ret`? If not, you'd still be using the caller's FP. (But that would probably have faulted, so probably not that.) — Peter Cordes, Jun 01 '23 at 14:43
In practice i absolutely agree not to do horizontal sum in a loop. It's a synthetic code. Let me check the sizes in C++.... — ANTON SAVELIEV, Jun 01 '23 at 14:48
Added C++ code more clear. And yes i think the main problem in Go code is arithmetic not with values, but with pointers — ANTON SAVELIEV, Jun 01 '23 at 15:27
Thank's all. The problem is solved. Pay attention to details :) — ANTON SAVELIEV, Jun 01 '23 at 16:11
Ok, now your C++ is correct for lengths that are a multiple of 4, when `len <= INT_MAX`. (It's usually not recommended to use a signed loop counter with an unsigned upper bound. If you're going to use any unsigned type, use `size_t` for both `len` and `i`.) The horizontal sum outside the loop could be more efficient, see [Fastest way to do horizontal SSE vector sum (or other reduction)](https://stackoverflow.com/q/6996764) . Your version is optimized for machine-code size in bytes, not for micro-ops. — Peter Cordes, Jun 01 '23 at 20:00
Oh right, `ADDQ $10, AX` is adding 10, not 0x10. I thought that looked weird, but I don't usually use Go assembly, I forgot it was like AT&T syntax in using decorators on immediates, not using `$` for hex like some assembly languages (e.g. many 8-bit micros). — Peter Cordes, Jun 01 '23 at 20:01

What is causing different results when summing two arrays of float numbers using SSE SIMD in C++ versus Go assembly?

0 Answers0