0

I am trying to write a simple code using SSE and SSE3 to calculate the sum of all the elements of an array. The difference is that in one of the codes I do the sum "vertically" using PADDD and in the other I do the sum horizontally, using HADPPS. Since the only value I am interested in is the total sum, the way I do the sum should not matter. However, the horizontal addition is outputting the wrong results. Any idea why?

This is the code for the regular add:

int sumelems_sse(int *a, int size)
{
  int tmp[4];
  tmp[0] = 0;
  tmp[1] = 0;
  tmp[2] = 0;
  tmp[3] = 0;
  int total;

  __asm__ volatile (
                   "\n\t movdqa %0,%%xmm0 \t#"       // moves tmp[0] to xmm0
                   : /* no output */
                   : "m" (tmp[0])   //%0
                   );

  for (int i=0;i<size;i+=4) {
    __asm__ volatile
        ( // instruction         comment          
        "\n\t movdqa     %0,%%xmm1     \t#"           // moves a[i] to xmm1
        "\n\t paddd    %%xmm1,%%xmm0  \t#"            // xmm0 = xmm0+xmm1 in 4 blocks of 32 bits
        : /* no output */
        : "m"  (a[i])       // %0 
        );
  }

   __asm__ volatile(
                   "\n\t movdqa %%xmm0,%0 \t#"         // moves xmm0 to tmp[0]
                   : "=m" (tmp[0])
                   );


   total = tmp[0] + tmp[1] + tmp[2] + tmp[3];
   return total;
}

And this is the code for the horizontal add:

int sumelems_sse3(int *a, int size)
{
  int tmp[4];
  tmp[0] = 0;
  tmp[1] = 0;
  tmp[2] = 0;
  tmp[3] = 0;
  int total;

  __asm__ volatile (
                   "\n\t movdqa %0,%%xmm0 \t#"       // moves tmp[0] to xmm0
                   : /* no output */
                   : "m" (tmp[0])   //%0
                   );

  for (int i=0;i<size;i+=4) {
    __asm__ volatile
        ( // instruction         comment          
        "\n\t movdqa     %0,%%xmm1     \t#"             // moves a[i] to xmm1
        "\n\t haddps      %%xmm1,%%xmm0   \t#"           // xmm0 = xmm0+xmm2 in 4 blocks of 32 bits
        : /* no output */
        : "m"  (a[i])       // %0 
        );
  }

   __asm__ volatile(
                   "\n\t movdqa %%xmm0,%0 \t#"         // moves xmm0 to tmp[0]
                   : "=m" (tmp[0])
                   );


   total = tmp[0] + tmp[1] + tmp[2] + tmp[3];
   return total;

}

I think only the adding instruction should change, or not?

zx485
  • 28,498
  • 28
  • 50
  • 59
julix
  • 13
  • 5
  • 4
    `haddps` treats the elements as single-precision floating point bit-patterns. `paddd` treates them as integers! Perhaps you wanted SSSE3 `phaddd`. But there's a big difference in performance [Fastest way to do horizontal SSE vector sum (or other reduction)](https://stackoverflow.com/q/6996764), and also your inline asm is unsafe: you aren't telling the compiler about the XMM registers you write, or that the XMM0 output from one is an input to the next. It could emit instructions that use XMM0. Inline asm is pointless for this, just use intrinsics. – Peter Cordes Jun 27 '20 at 03:59
  • 1
    See [Getting started with Intel x86 SSE SIMD instructions](https://stackoverflow.com/q/1389712) – Peter Cordes Jun 27 '20 at 04:03
  • I think if you did it right, this would actually (slowly) sum the array, though, with new data percolating towards the bottom element of the accumulator vector. e.g. after a few steps, you might have a vector like `m+n, o+p, i+j+k+l, e+f+g+h+a+b+c+d`. (I deleted an earlier comment I made before actually working through an example for a few steps. This is viable in theory, but terrible for performance and this specific implementation is doing almost everything wrong.) – Peter Cordes Jun 28 '20 at 09:14

0 Answers0