SVML's __m256d _mm256_log2_pd (__m256d a)
is not available on other compilers than Intel, and they say its performance is handicapped on AMD processors. There are some implementations on the internet referred in AVX log intrinsics (_mm256_log_ps) missing in g++-4.8? and SIMD math libraries for SSE and AVX , however they seem to be more SSE than AVX2. There's also Agner Fog's vector library , however it's a large library having much more stuff that just vector log2, so from the implementation in it it's hard to figure out the essential parts for just the vector log2 operation.
So can someone just explain how to implement log2()
operation for a vector of 4 double
numbers efficiently? I.e. like what __m256d _mm256_log2_pd (__m256d a)
does, but available for other compilers and reasonably efficient for both AMD and Intel processors.
EDIT: In my current specific case, the numbers are probabilities between 0 and 1, and logarithm is used for entropy computation: the negation of sum over all i
of P[i]*log(P[i])
. The range of floating-point exponents for P[i]
is large, so the numbers can be close to 0. I'm not sure about accuracy, so would consider any solution starting with 30 bits of mantissa, especially a tuneable solution is preferred.
EDIT2: here is my implementation so far, based on "More efficient series" from https://en.wikipedia.org/wiki/Logarithm#Power_series . How can it be improved? (both performance and accuracy improvements are desired)
namespace {
const __m256i gDoubleExpMask = _mm256_set1_epi64x(0x7ffULL << 52);
const __m256i gDoubleExp0 = _mm256_set1_epi64x(1023ULL << 52);
const __m256i gTo32bitExp = _mm256_set_epi32(0, 0, 0, 0, 6, 4, 2, 0);
const __m128i gExpNormalizer = _mm_set1_epi32(1023);
//TODO: some 128-bit variable or two 64-bit variables here?
const __m256d gCommMul = _mm256_set1_pd(2.0 / 0.693147180559945309417); // 2.0/ln(2)
const __m256d gCoeff1 = _mm256_set1_pd(1.0 / 3);
const __m256d gCoeff2 = _mm256_set1_pd(1.0 / 5);
const __m256d gCoeff3 = _mm256_set1_pd(1.0 / 7);
const __m256d gCoeff4 = _mm256_set1_pd(1.0 / 9);
const __m256d gVect1 = _mm256_set1_pd(1.0);
}
__m256d __vectorcall Log2(__m256d x) {
const __m256i exps64 = _mm256_srli_epi64(_mm256_and_si256(gDoubleExpMask, _mm256_castpd_si256(x)), 52);
const __m256i exps32_avx = _mm256_permutevar8x32_epi32(exps64, gTo32bitExp);
const __m128i exps32_sse = _mm256_castsi256_si128(exps32_avx);
const __m128i normExps = _mm_sub_epi32(exps32_sse, gExpNormalizer);
const __m256d expsPD = _mm256_cvtepi32_pd(normExps);
const __m256d y = _mm256_or_pd(_mm256_castsi256_pd(gDoubleExp0),
_mm256_andnot_pd(_mm256_castsi256_pd(gDoubleExpMask), x));
// Calculate t=(y-1)/(y+1) and t**2
const __m256d tNum = _mm256_sub_pd(y, gVect1);
const __m256d tDen = _mm256_add_pd(y, gVect1);
const __m256d t = _mm256_div_pd(tNum, tDen);
const __m256d t2 = _mm256_mul_pd(t, t); // t**2
const __m256d t3 = _mm256_mul_pd(t, t2); // t**3
const __m256d terms01 = _mm256_fmadd_pd(gCoeff1, t3, t);
const __m256d t5 = _mm256_mul_pd(t3, t2); // t**5
const __m256d terms012 = _mm256_fmadd_pd(gCoeff2, t5, terms01);
const __m256d t7 = _mm256_mul_pd(t5, t2); // t**7
const __m256d terms0123 = _mm256_fmadd_pd(gCoeff3, t7, terms012);
const __m256d t9 = _mm256_mul_pd(t7, t2); // t**9
const __m256d terms01234 = _mm256_fmadd_pd(gCoeff4, t9, terms0123);
const __m256d log2_y = _mm256_mul_pd(terms01234, gCommMul);
const __m256d log2_x = _mm256_add_pd(log2_y, expsPD);
return log2_x;
}
So far my implementation gives 405 268 490 operations per second, and it seems precise till the 8th digit. The performance is measured with the following function:
#include <chrono>
#include <cmath>
#include <cstdio>
#include <immintrin.h>
// ... Log2() implementation here
const int64_t cnLogs = 100 * 1000 * 1000;
void BenchmarkLog2Vect() {
__m256d sums = _mm256_setzero_pd();
auto start = std::chrono::high_resolution_clock::now();
for (int64_t i = 1; i <= cnLogs; i += 4) {
const __m256d x = _mm256_set_pd(double(i+3), double(i+2), double(i+1), double(i));
const __m256d logs = Log2(x);
sums = _mm256_add_pd(sums, logs);
}
auto elapsed = std::chrono::high_resolution_clock::now() - start;
double nSec = 1e-6 * std::chrono::duration_cast<std::chrono::microseconds>(elapsed).count();
double sum = sums.m256d_f64[0] + sums.m256d_f64[1] + sums.m256d_f64[2] + sums.m256d_f64[3];
printf("Vect Log2: %.3lf Ops/sec calculated %.3lf\n", cnLogs / nSec, sum);
}
Comparing to the results of Logarithm in C++ and assembly , the current vector implementation is 4 times faster than std::log2()
and 2.5 times faster than std::log()
.