I have been trying to get started with the AVX2 instructions with not a lot of luck (this list of functions have been helpful). At the end, I got my first program compiling and doing what I wanted. The program that I have to do takes two u_char
and compounds a double out of it. Essentially, I use this to decode data stored in an array of u_char from a camera but I do not think is relevant for this question.
The process of obtaining the double
of of the two u_char
is:
double result = sqrt(double((msb<<8) + lsb)/64);
where msb
and lsb
are the two u_char
variables with the most significant bits (msb
) and the less significant bits (lsb
) of the double
to compute. The data is stored in an array representing a row-major matrix where the msb
and lsb
of the value encoded column i
are in the second and third rows respectively. I have coded this with and without AVX2:
void getData(u_char* data, size_t cols, std::vector<double>& info)
{
info.resize(cols);
for (size_t i = 0; i < cols; i++)
{
info[i] = sqrt(double((data[cols + i] << 8) + data[2 * cols + i]) / 64.0);
;
}
}
void getDataAVX2(u_char* data, size_t cols, std::vector<double>& info)
{
__m256d dividend = _mm256_set_pd(1 / 64.0, 1 / 64.0, 1 / 64.0, 1 / 64.0);
info.resize(cols);
__m256d result;
for (size_t i = 0; i < cols / 4; i++)
{
__m256d divisor = _mm256_set_pd(double((data[4 * i + 3 + cols] << 8) + data[4 * i + 2 * cols + 3]),
double((data[4 * i + 2 + cols] << 8) + data[4 * i + 2 * cols + 2]),
double((data[4 * i + 1 + cols] << 8) + data[4 * i + 2 * cols + 1]),
double((data[4 * i + cols] << 8) + data[4 * i + 2 * cols]));
_mm256_storeu_pd(&info[0] + 4 * i, _mm256_sqrt_pd(_mm256_mul_pd(divisor, dividend)));
}
}
However, to my surprise, this code is slower than the normal one? Any ideas on how to speed it up?
I am compiling with c++
(7.3.0) with the following options -std=c++17 -Wall -Wextra -O3 -fno-tree-vectorize -mavx2
. I have checked as explained here and my CPU (Intel(R) Core(TM) i7-4710HQ CPU @ 2.50GHz) supports AVX2.
To check which one is faster is using time. The following function gives me timestamp:
inline double timestamp()
{
struct timeval tp;
gettimeofday(&tp, nullptr);
return double(tp.tv_sec) + tp.tv_usec / 1000000.;
}
I get timestamp before and after each function getData
and getDataAVX2
and subtract them to get the elapsed time on each function. The overall main
is the following:
int main(int argc, char** argv)
{
u_char data[] = {
0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0x11, 0xf, 0xf, 0xf, 0xf, 0xf, 0x10, 0xf, 0xf,
0xf, 0xf, 0xe, 0x10, 0x10, 0xf, 0x10, 0xf, 0xf, 0x10, 0xf, 0xf, 0xf, 0xf, 0xf, 0xf, 0x10, 0x10, 0xf,
0x10, 0xf, 0xe, 0xf, 0xf, 0x10, 0xf, 0xf, 0x10, 0xf, 0xf, 0xf, 0xf, 0x10, 0xf, 0xf, 0xf, 0xf, 0xf,
0xf, 0xf, 0xf, 0x10, 0xf, 0xf, 0xf, 0x10, 0xf, 0xf, 0xf, 0xf, 0xe, 0xf, 0xf, 0xf, 0xf, 0xf, 0x10,
0x10, 0xf, 0xf, 0xf, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2,
0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2,
0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2,
0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2,
0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xf2, 0xd3, 0xd1, 0xca, 0xc6, 0xd2, 0xd2, 0xcc, 0xc8, 0xc2, 0xd0, 0xd0,
0xca, 0xc9, 0xcb, 0xc7, 0xc3, 0xc7, 0xca, 0xce, 0xca, 0xc9, 0xc2, 0xc8, 0xc2, 0xbe, 0xc2, 0xc0, 0xb8, 0xc4, 0xbd,
0xc5, 0xc9, 0xbc, 0xbf, 0xbc, 0xb5, 0xb6, 0xc1, 0xbe, 0xb7, 0xb9, 0xc8, 0xb9, 0xb2, 0xb2, 0xba, 0xb4, 0xb4, 0xb7,
0xad, 0xb2, 0xb6, 0xab, 0xb7, 0xaf, 0xa7, 0xa8, 0xa5, 0xaa, 0xb0, 0xa3, 0xae, 0xa9, 0xa0, 0xa6, 0xa5, 0xa8, 0x9f,
0xa0, 0x9e, 0x94, 0x9f, 0xa3, 0x9d, 0x9f, 0x9c, 0x9e, 0x99, 0x9a, 0x97, 0x4, 0x5, 0x4, 0x5, 0x4, 0x4, 0x5,
0x5, 0x5, 0x4, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x4, 0x4, 0x4, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5,
0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5,
0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x5, 0x4, 0x4, 0x4, 0x5, 0x5, 0x5, 0x4, 0x4,
0x5, 0x5, 0x5, 0x5, 0x4, 0x5, 0x5, 0x4, 0x4, 0x6, 0x4, 0x4, 0x6, 0x5, 0x4, 0x5, 0xf0, 0xf0, 0xf0,
0xf0, 0xf0, 0xf0, 0xe0, 0xf0, 0xe0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0,
0xf0, 0xf0, 0xe0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0,
0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0,
0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0, 0xf0,
0xf0
};
size_t cols = 80;
// Normal
std::cout << "Computing with normal way" << std::endl;
std::vector<double> info;
double tstart_normal = timestamp();
getData(data, cols, info);
double time_normal = timestamp() - tstart_normal;
// AVX2
std::cout << "Computing with avx" << std::endl;
std::vector<double> info_avx2;
double tstart_avx2 = timestamp();
getDataAVX2(data, cols, info_avx2);
double time_avx2 = timestamp() - tstart_avx2;
// Display difference
std::cout << "Time normal: " << time_normal << " s" << std::endl;
std::cout << "Time AVX2: " << time_avx2 << " s" << std::endl;
std::cout << "Time improvement AVX2: " << time_normal / time_avx2 << std::endl;
// Write to file
std::ofstream file;
file.open("out.csv");
for (size_t i = 0; i < cols; i++)
{
file << info[size_t(i)] << "," << info_avx2[size_t(i)];
file << std::endl;
}
file.close();
// Exit
return 0;
}
The full example can be found here.