I need to build a single-precision floating-point inner product routine for mixed single/double-precision floating-point vectors, exploiting the AVX instruction set for SIMD registers with 256 bits.
Problem: one input vector is float (x), while the other is double (yD).
Hence, before to compute the true inner product operations, I need to convert my input yD vector data from double to float.
Using the SSE2 instruction set, I was able to implement a very fast code doing what I needed, and with speed performances very close to the case when both vectors x and y were float:
void vector_operation(const size_t i)
{
__m128 X = _mm_load_ps(x + i);
__m128 Y = _mm_movelh_ps(_mm_cvtpd_ps(_mm_load_pd(yD + i + 0)), _mm_cvtpd_ps(_mm_load_pd(yD + i + 2)));
//inner-products accumulation
res = _mm_add_ps(res, _mm_mul_ps(X, Y));
}
Now, with the hope to further speed-up, I implemented a correpsonding version with AVX instruction set:
inline void vector_operation(const size_t i)
{
__m256 X = _mm256_load_ps(x + i);
__m128 yD1 = _mm_cvtpd_ps(_mm_load_pd(yD + i + 0));
__m128 yD2 = _mm_cvtpd_ps(_mm_load_pd(yD + i + 2));
__m128 yD3 = _mm_cvtpd_ps(_mm_load_pd(yD + i + 4));
__m128 yD4 = _mm_cvtpd_ps(_mm_load_pd(yD + i + 6));
__m128 Ylow = _mm_movelh_ps(yD1, yD2);
__m128 Yhigh = _mm_movelh_ps(yD3, yD4);
//Pack __m128 data inside __m256
__m256 Y = _mm256_permute2f128_ps(_mm256_castps128_ps256(Ylow), _mm256_castps128_ps256(Yhigh), 0x20);
//inner-products accumulation
res = _mm256_add_ps(res, _mm256_mul_ps(X, Y));
}
I also tested other AVX implementations using, for example, casting and insertion operations instead of perfmuting data. Performances were comparably poor compared to the case where both x and y vectors were float.
The problem with the AVX code is that no matter how I implemented it, its performance is by far inferior to the ones achieved by using only float x and y vectors (i.e. no double-float conversion is needed).
The conversion from double to float for the yD vector seems pretty fast, while a lot of time is lost in the line where data is inserted in the _m256 Y register.
Do you know if this is a well-known issue with AVX?
Do you have a solution that could preserve good performances?
Thanks in advance!