In my previous post I explain that I am starting with AVX to speed up my code (please, note that although there are parts in common this post refers to AVX512 and the previous one to AVX2 which as far as I know are slightly different and need different compiling flags). After experimenting with AVX2 I decided to try with AVX512 and changed my AVX2 function:
void getDataAVX2(u_char* data, size_t cols, std::vector<double>& info)
{
__m256d dividend = _mm256_set_pd(1 / 64.0, 1 / 64.0, 1 / 64.0, 1 / 64.0);
info.resize(cols);
__m256d result;
for (size_t i = 0; i < cols / 4; i++)
{
__m256d divisor = _mm256_set_pd((double(data[4 * i + 3 + cols] << 8) + double(data[4 * i + 2 * cols + 3])),
(double(data[4 * i + 2 + cols] << 8) + double(data[4 * i + 2 * cols + 2])),
(double(data[4 * i + 1 + cols] << 8) + double(data[4 * i + 2 * cols + 1])),
(double(data[4 * i + cols] << 8) + double(data[4 * i + 2 * cols])));
result = _mm256_sqrt_pd(_mm256_mul_pd(divisor, dividend));
info[size_t(4 * i)] = result[0];
info[size_t(4 * i + 1)] = result[1];
info[size_t(4 * i + 2)] = result[2];
info[size_t(4 * i + 3)] = result[3];
}
}
for what I think should be its equivalent:
void getDataAVX512(u_char* data, size_t cols, std::vector<double>& info)
{
__m512d dividend = _mm512_set_pd(1 / 64.0, 1 / 64.0, 1 / 64.0, 1 / 64.0, 1 / 64.0, 1 / 64.0, 1 / 64.0, 1 / 64.0);
info.resize(cols);
__m512d result;
for (size_t i = 0; i < cols / 8; i++)
{
__m512d divisor = _mm512_set_pd((double(data[4 * i + 7 + cols] << 8) + double(data[4 * i + 2 * cols + 7])),
(double(data[4 * i + 6 + cols] << 8) + double(data[4 * i + 2 * cols + 6])),
(double(data[4 * i + 5 + cols] << 8) + double(data[4 * i + 2 * cols + 5])),
(double(data[4 * i + 4 + cols] << 8) + double(data[4 * i + 2 * cols + 4])),
(double(data[4 * i + 3 + cols] << 8) + double(data[4 * i + 2 * cols + 3])),
(double(data[4 * i + 2 + cols] << 8) + double(data[4 * i + 2 * cols + 2])),
(double(data[4 * i + 1 + cols] << 8) + double(data[4 * i + 2 * cols + 1])),
(double(data[4 * i + cols] << 8) + double(data[4 * i + 2 * cols])));
result = _mm512_sqrt_pd(_mm512_mul_pd(divisor, dividend));
info[size_t(4 * i)] = result[0];
info[size_t(4 * i + 1)] = result[1];
info[size_t(4 * i + 2)] = result[2];
info[size_t(4 * i + 3)] = result[3];
info[size_t(4 * i + 4)] = result[4];
info[size_t(4 * i + 5)] = result[5];
info[size_t(4 * i + 6)] = result[6];
info[size_t(4 * i + 7)] = result[7];
}
}
which in a non AVX form is:
void getData(u_char* data, size_t cols, std::vector<double>& info)
{
info.resize(cols);
for (size_t i = 0; i < cols; i++)
{
info[i] = sqrt((double(data[cols + i] << 8) + double(data[2 * cols + i])) / 64.0);
;
}
}
After compiling the code I get the following error:
Illegal instruction (core dumped)
To my surprise, this error occurs in the call of sqrt
in the getData
function. If I remove the sqrt
call then the error appears further forward, in the __m512d divisor = _mm512_set_pd((d....
. Any ideas on what is happening?
Here is the full example.
Thank you very much.
I am compiling with c++
(7.3.0) with the following options -std=c++17 -Wall -Wextra -O3 -fno-tree-vectorize -mavx512f
. I have checked as explained here and my CPU (Intel(R) Core(TM) i7-4710HQ CPU @ 2.50GHz) supports AVX2. Should the list have AVX-512 to indicate support for this?