8

I need to build a single-precision floating-point inner product routine for mixed single/double-precision floating-point vectors, exploiting the AVX instruction set for SIMD registers with 256 bits.

Problem: one input vector is float (x), while the other is double (yD).

Hence, before to compute the true inner product operations, I need to convert my input yD vector data from double to float.

Using the SSE2 instruction set, I was able to implement a very fast code doing what I needed, and with speed performances very close to the case when both vectors x and y were float:

  void vector_operation(const size_t i) 
  {
    __m128 X = _mm_load_ps(x + i);
    __m128 Y = _mm_movelh_ps(_mm_cvtpd_ps(_mm_load_pd(yD + i + 0)), _mm_cvtpd_ps(_mm_load_pd(yD + i + 2)));
    //inner-products accumulation
    res = _mm_add_ps(res, _mm_mul_ps(X, Y));
  }   

Now, with the hope to further speed-up, I implemented a correpsonding version with AVX instruction set:

  inline void vector_operation(const size_t i) 
  {
    __m256 X = _mm256_load_ps(x + i);
    __m128 yD1 = _mm_cvtpd_ps(_mm_load_pd(yD + i + 0));
    __m128 yD2 = _mm_cvtpd_ps(_mm_load_pd(yD + i + 2));
    __m128 yD3 = _mm_cvtpd_ps(_mm_load_pd(yD + i + 4));
    __m128 yD4 = _mm_cvtpd_ps(_mm_load_pd(yD + i + 6));

    __m128 Ylow = _mm_movelh_ps(yD1, yD2);
    __m128 Yhigh = _mm_movelh_ps(yD3, yD4);

    //Pack __m128 data inside __m256 
    __m256 Y = _mm256_permute2f128_ps(_mm256_castps128_ps256(Ylow), _mm256_castps128_ps256(Yhigh), 0x20);

    //inner-products accumulation 
    res = _mm256_add_ps(res, _mm256_mul_ps(X, Y));
  }

I also tested other AVX implementations using, for example, casting and insertion operations instead of perfmuting data. Performances were comparably poor compared to the case where both x and y vectors were float.

The problem with the AVX code is that no matter how I implemented it, its performance is by far inferior to the ones achieved by using only float x and y vectors (i.e. no double-float conversion is needed).

The conversion from double to float for the yD vector seems pretty fast, while a lot of time is lost in the line where data is inserted in the _m256 Y register.

Do you know if this is a well-known issue with AVX?

Do you have a solution that could preserve good performances?

Thanks in advance!

Liotro78
  • 111
  • 5
  • What compiler? And what compiler options? Also, what generation processor? Haswell? Skylake? – Mysticial Mar 21 '18 at 19:52
  • Compiler: Microsoft Visual Studio C++ 2013, CPU: ("Kaby Lake-U") Intel® Core™ i7-7500U [link](https://ark.intel.com/products/95451/Intel-Core-i7-7500U-Processor-4M-Cache-up-to-3_50-GHz-). – Liotro78 Mar 21 '18 at 20:21
  • Did you pass the flag `/arch:avx`? – Mysticial Mar 21 '18 at 20:23
  • I didn't. For SSE2 code I didn't add any flag, just activated the intrinsics in the properties of the project. I will try tomorrow to add the flag /arch:avx. I will let you know. What do you think about the code? – Liotro78 Mar 21 '18 at 20:26
  • 1
    Worth a read: https://stackoverflow.com/questions/7839925/using-avx-cpu-instructions-poor-performance-without-archavx – Mysticial Mar 21 '18 at 20:28
  • I found this discussion [link](https://stackoverflow.com/questions/20169064/does-archavx-enable-avx2), maybe this was the problem. Let's see tomorrow! Thanks – Liotro78 Mar 21 '18 at 20:29
  • Wow, thanks a lot. I will read it! – Liotro78 Mar 21 '18 at 20:29
  • Your statement, "The conversion from double to float for the yD vector seems pretty fast, while a lot of time is lost in the line where data is inserted in the _m256 Y register." - made me suspect you were using Visual Studio without `/arch:AVX` as that would be precisely what you would see under a profiler. While I'm not sure if this is the cause (since you have a *-lake processor), it's still suspect. – Mysticial Mar 21 '18 at 20:30
  • Apart the main topic, what is your opinion on the AVX code? Do you think it is already in a good shape to perform decently fast when the /arch:AVX will be added? – Liotro78 Mar 21 '18 at 20:35
  • I don't know. I haven't tried to parse what it does to see if what you have is efficient or not. – Mysticial Mar 21 '18 at 20:39
  • @Liotro78: you should probably be using 256-bit loads and 256-bit `vcvtpd2ps`, then shuffle those results together, especially if you have AVX2. Also, pass args to your inline function instead of keeping pointers in globals like `yD` or something. That might not inline away. – Peter Cordes Mar 21 '18 at 22:59
  • 1
    Thanks to all for your valuable inputs. Adding /arch:AVX flag fixed the problem! – Liotro78 Mar 23 '18 at 14:54

1 Answers1

4

I rewrote your function and took better advantage of what AVX has to offer. I also used fused multiply-add at the end; if you can't use FMA, just replace that line with addition and multiplication. I only now see that I wrote an implementation that uses unaligned loads and yours uses aligned loads, but I'm not gonna lose any sleep over it. :)

__m256 foo(float*x, double* yD, const size_t i, __m256 res_prev)
{
  __m256 X = _mm256_loadu_ps(x + i);

  __m128 yD21 = _mm256_cvtpd_ps(_mm256_loadu_pd(yD + i + 0));
  __m128 yD43 = _mm256_cvtpd_ps(_mm256_loadu_pd(yD + i + 4));

  __m256 Y = _mm256_set_m128(yD43, yD21);

  return _mm256_fmadd_ps(X, Y, res_prev);
}

I did a quick benhmark and compared running times of your and my implementation. I tried two different benchmark approaches with several repetitions and every time my code was around 15% faster. I used MSVC 14.1 compiler and compiled the program with /O2 and /arch:AVX2 flags.

EDIT: this is the disassembly of the function:

vcvtpd2ps   xmm3,ymmword ptr [rdx+r8*8+20h]  
vcvtpd2ps   xmm2,ymmword ptr [rdx+r8*8]  
vmovups     ymm0,ymmword ptr [rcx+r8*4]  

vinsertf128 ymm3,ymm2,xmm3,1  

vfmadd213ps ymm0,ymm3,ymmword ptr [r9] 

EDIT 2: this is the disassembly of your AVX implementation of the same algorithm:

vcvtpd2ps   xmm0,xmmword ptr [rdx+r8*8+30h]  
vcvtpd2ps   xmm1,xmmword ptr [rdx+r8*8+20h]  

vmovlhps    xmm3,xmm1,xmm0  
vcvtpd2ps   xmm0,xmmword ptr [rdx+r8*8+10h]  
vcvtpd2ps   xmm1,xmmword ptr [rdx+r8*8]  
vmovlhps    xmm2,xmm1,xmm0  

vperm2f128  ymm3,ymm2,ymm3,20h  

vmulps      ymm0,ymm3,ymmword ptr [rcx+r8*4]  
vaddps      ymm0,ymm0,ymmword ptr [r9]
Nejc
  • 927
  • 6
  • 15
  • 1
    Yup, much better. All versions of `vcvtpd2ps` cost a shuffle uop on Haswell / Skylake, so using YMM is much better because it also leads to less shuffling *after* to create a 256b vector. Your stand-alone version of the function ends up with an extra memory operand because MSVC's default calling convention passes `__m256` by hidden reference; unfortunately MS didn't use `__vectorcall` for x86-64 by default. Anyway, this should go away when inlining. – Peter Cordes Mar 22 '18 at 02:51