I'm attempting to use vector intrinsics to speed up a trivial piece of code (as a test), and I'm not getting a speed up - in fact, it runs slower by a bit sometimes. I'm wondering two things:
- Do vectorized instructions speed up simple load from one region / store to another type operations in any way?
- Division intrinsics aren't yielding anything faster either, and in fact, I started getting segfaults when I introduced the
_mm256_div_pd
intrinsic. Is my usage correct?
constexpr size_t VECTORSIZE{ (size_t)1024 * 1024 * 64 }; //large array to force main memory accesses
void normal_copy(const fftw_complex* in, fftw_complex* copyto, size_t copynum)
{
for (size_t i = 0; i < copynum; i++)
{
copyto[i][0] = in[i][0] / 128.0;
copyto[i][1] = in[i][1] / 128.0;
}
}
#if defined(_WIN32) || defined(_WIN64)
void avx2_copy(const fftw_complex* __restrict in, fftw_complex* __restrict copyto, size_t copynum)
#else
void avx2_copy(const fftw_complex* __restrict__ in, fftw_complex* __restrict__ copyto, size_t copynum)
#endif
{ //avx2 supports 256 bit vectorized instructions
constexpr double zero = 0.0;
constexpr double dnum = 128.0;
__m256d tmp = _mm256_broadcast_sd(&zero);
__m256d div = _mm256_broadcast_sd(&dnum);
for (size_t i = 0; i < copynum; i += 2)
{
tmp = _mm256_load_pd(&in[i][0]);
tmp = _mm256_div_pd(tmp, div);
_mm256_store_pd(©to[i][0], tmp);
}
}
int main()
{
fftw_complex* invec = (fftw_complex*)fftw_malloc(VECTORSIZE * sizeof(fftw_complex));
fftw_complex* outvec1 = (fftw_complex*)fftw_malloc(VECTORSIZE * sizeof(fftw_complex));
fftw_complex* outvec3 = (fftw_complex*)fftw_malloc(VECTORSIZE * sizeof(fftw_complex));
//some initialization stuff for invec
//some timing stuff (wall clock)
normal_copy(invec, outvec1, VECTORSIZE);
//some timing stuff (wall clock)
avx2_copy(invec, outvec3, VECTORSIZE);
return 0;
}
fftw_complex
is a datatype equivalent to std::complex. I've tested using both g++
(with -O3
and -ftree-vectorize
) on Linux, and Visual Studio on Windows - same results - AVX2 copy and div is slower and segfaults for certain array sizes. Tested array sizes are always powers of 2, so anything related to reading invalid memory (from _mm256_load_pd
) doesn't seem to be the issue. Any thoughts?