Efficient implementation of log2(__m256d) in AVX2

Question

SVML's __m256d _mm256_log2_pd (__m256d a) is not available on other compilers than Intel, and they say its performance is handicapped on AMD processors. There are some implementations on the internet referred in AVX log intrinsics (_mm256_log_ps) missing in g++-4.8? and SIMD math libraries for SSE and AVX , however they seem to be more SSE than AVX2. There's also Agner Fog's vector library , however it's a large library having much more stuff that just vector log2, so from the implementation in it it's hard to figure out the essential parts for just the vector log2 operation.

So can someone just explain how to implement log2() operation for a vector of 4 double numbers efficiently? I.e. like what __m256d _mm256_log2_pd (__m256d a) does, but available for other compilers and reasonably efficient for both AMD and Intel processors.

EDIT: In my current specific case, the numbers are probabilities between 0 and 1, and logarithm is used for entropy computation: the negation of sum over all i of P[i]*log(P[i]). The range of floating-point exponents for P[i] is large, so the numbers can be close to 0. I'm not sure about accuracy, so would consider any solution starting with 30 bits of mantissa, especially a tuneable solution is preferred.

EDIT2: here is my implementation so far, based on "More efficient series" from https://en.wikipedia.org/wiki/Logarithm#Power_series . How can it be improved? (both performance and accuracy improvements are desired)

namespace {
  const __m256i gDoubleExpMask = _mm256_set1_epi64x(0x7ffULL << 52);
  const __m256i gDoubleExp0 = _mm256_set1_epi64x(1023ULL << 52);
  const __m256i gTo32bitExp = _mm256_set_epi32(0, 0, 0, 0, 6, 4, 2, 0);
  const __m128i gExpNormalizer = _mm_set1_epi32(1023);
  //TODO: some 128-bit variable or two 64-bit variables here?
  const __m256d gCommMul = _mm256_set1_pd(2.0 / 0.693147180559945309417); // 2.0/ln(2)
  const __m256d gCoeff1 = _mm256_set1_pd(1.0 / 3);
  const __m256d gCoeff2 = _mm256_set1_pd(1.0 / 5);
  const __m256d gCoeff3 = _mm256_set1_pd(1.0 / 7);
  const __m256d gCoeff4 = _mm256_set1_pd(1.0 / 9);
  const __m256d gVect1 = _mm256_set1_pd(1.0);
}

__m256d __vectorcall Log2(__m256d x) {
  const __m256i exps64 = _mm256_srli_epi64(_mm256_and_si256(gDoubleExpMask, _mm256_castpd_si256(x)), 52);
  const __m256i exps32_avx = _mm256_permutevar8x32_epi32(exps64, gTo32bitExp);
  const __m128i exps32_sse = _mm256_castsi256_si128(exps32_avx);
  const __m128i normExps = _mm_sub_epi32(exps32_sse, gExpNormalizer);
  const __m256d expsPD = _mm256_cvtepi32_pd(normExps);
  const __m256d y = _mm256_or_pd(_mm256_castsi256_pd(gDoubleExp0),
    _mm256_andnot_pd(_mm256_castsi256_pd(gDoubleExpMask), x));

  // Calculate t=(y-1)/(y+1) and t**2
  const __m256d tNum = _mm256_sub_pd(y, gVect1);
  const __m256d tDen = _mm256_add_pd(y, gVect1);
  const __m256d t = _mm256_div_pd(tNum, tDen);
  const __m256d t2 = _mm256_mul_pd(t, t); // t**2

  const __m256d t3 = _mm256_mul_pd(t, t2); // t**3
  const __m256d terms01 = _mm256_fmadd_pd(gCoeff1, t3, t);
  const __m256d t5 = _mm256_mul_pd(t3, t2); // t**5
  const __m256d terms012 = _mm256_fmadd_pd(gCoeff2, t5, terms01);
  const __m256d t7 = _mm256_mul_pd(t5, t2); // t**7
  const __m256d terms0123 = _mm256_fmadd_pd(gCoeff3, t7, terms012);
  const __m256d t9 = _mm256_mul_pd(t7, t2); // t**9
  const __m256d terms01234 = _mm256_fmadd_pd(gCoeff4, t9, terms0123);

  const __m256d log2_y = _mm256_mul_pd(terms01234, gCommMul);
  const __m256d log2_x = _mm256_add_pd(log2_y, expsPD);

  return log2_x;
}

So far my implementation gives 405 268 490 operations per second, and it seems precise till the 8th digit. The performance is measured with the following function:

#include <chrono>
#include <cmath>
#include <cstdio>
#include <immintrin.h>

// ... Log2() implementation here

const int64_t cnLogs = 100 * 1000 * 1000;

void BenchmarkLog2Vect() {
  __m256d sums = _mm256_setzero_pd();
  auto start = std::chrono::high_resolution_clock::now();
  for (int64_t i = 1; i <= cnLogs; i += 4) {
    const __m256d x = _mm256_set_pd(double(i+3), double(i+2), double(i+1), double(i));
    const __m256d logs = Log2(x);
    sums = _mm256_add_pd(sums, logs);
  }
  auto elapsed = std::chrono::high_resolution_clock::now() - start;
  double nSec = 1e-6 * std::chrono::duration_cast<std::chrono::microseconds>(elapsed).count();
  double sum = sums.m256d_f64[0] + sums.m256d_f64[1] + sums.m256d_f64[2] + sums.m256d_f64[3];
  printf("Vect Log2: %.3lf Ops/sec calculated %.3lf\n", cnLogs / nSec, sum);
}

Comparing to the results of Logarithm in C++ and assembly , the current vector implementation is 4 times faster than std::log2() and 2.5 times faster than std::log().

Specifically, the following approximation formula is used:

Can't you just use the log function in avx_mathfun and multiply the result by the required constant ? — Paul R, Aug 19 '17 at 10:32
@PaulR , it's for `float`, not `double`. At a minimum I don't know how to get the constants like `cephes_log_p0` for double numbers: https://github.com/reyoung/avx_mathfun/blob/master/avx_mathfun.h — Serge Rogatch, Aug 19 '17 at 10:42
Ah - hadn't noticed that - in that case I suggest looking for an SSE solution (e.g. see [this question and its answers](https://stackoverflow.com/q/4431505/253056)) - it should be easy to extend an SSE implementation to AVX. — Paul R, Aug 19 '17 at 10:49
@SergeRogatch You can use your favorite method of polynomial fitting to generate these constants. — , Aug 19 '17 at 12:09
@EOF, see http://www.agner.org/optimize/blog/read.php?i=209&v=t , http://www.agner.org/optimize/blog/read.php?i=115&v=t . Also Intel itself in its description for SVML says it's "optimized for Intel processors", while what really happens is a processor vendor check, and then branching to suboptimal code if it's not Intel. — Serge Rogatch, Aug 19 '17 at 15:52
@SergeRogatch Considering that these blog entries are several years old and show fairly limited effects, I'd say the tested libraries meet the requirement of being "reasonably" efficient on both AMD and Intel processors. — EOF, Aug 19 '17 at 16:20
Intel recently added AVX2 and FMA optimized math functions to glibc https://phoronix.com/scan.php?page=news_item&px=Intel-AVX2-FMA-Math-Glibc-2.27 — Z boson, Aug 23 '17 at 09:21

Peter Cordes · Answer 1 · 2020-03-29T23:31:06.163

The usual strategy is based on the identity log(a*b) = log(a) + log(b), or in this case log2( 2^exponent * mantissa) ) = log2( 2^exponent ) + log2(mantissa). Or simplifying, exponent + log2(mantissa). The mantissa has a very limited range, 1.0 to 2.0, so a polynomial for log2(mantissa) only has to fit over that very limited range. (Or equivalently, mantissa = 0.5 to 1.0, and change the exponent bias-correction constant by 1).

A Taylor series expansion is a good starting point for the coefficients, but you usually want to minimize the max-absolute-error (or relative error) over that specific range, and Taylor series coefficients likely leave have a lower or higher outlier over that range, rather than having the max positive error nearly matching the max negative error. So you can do what's called a minimax fit of the coefficients.

If it's important that your function evaluates log2(1.0) to exactly 0.0, you can arrange for that to happen by actually using mantissa-1.0 as your polynomial, and no constant coefficient. 0.0 ^ n = 0.0. This greatly improves the relative error for inputs near 1.0 as well, even if the absolute error is still small.

How accurate do you need it to be, and over what range of inputs? As usual there's a tradeoff between accuracy and speed, but fortunately it's pretty easy to move along that scale by e.g. adding one more polynomial term (and re-fitting the coefficients), or by dropping some rounding-error avoidance.

Agner Fog's VCL implementation of log_d() aims for very high accuracy, using tricks to avoid rounding error by avoiding things that might result in adding a small and a large number when possible. This obscures the basic design somewhat.

For a faster more approximate float log(), see the polynomial implementation on http://jrfonseca.blogspot.ca/2008/09/fast-sse2-pow-tables-or-polynomials.html. It leaves out a LOT of the extra precision-gaining tricks that VCL uses, so it's easier to understand. It uses a polynomial approximation for the mantissa over the 1.0 to 2.0 range.

(That's the real trick to log() implementations: you only need a polynomial that works over a small range.)

It already just does log2 instead of log, unlike VCL's where the log-base-e is baked in to the constants and how it uses them. Reading it is probably a good starting point for understanding exponent + polynomial(mantissa) implementations of log().

Even the highest-precision version of it is not full float precision, let alone double, but you could fit a polynomial with more terms. Or apparently a ratio of two polynomials works well; that's what VCL uses for double.

I got excellent results from porting JRF's SSE2 function to AVX2 + FMA (and especially AVX512 with _mm512_getexp_ps and _mm512_getmant_ps), once I tuned it carefully. (It was part of a commercial project, so I don't think I can post the code.) A fast approximate implementation for float was exactly what I wanted.

In my use-case, each jrf_fastlog() was independent, so OOO execution nicely hid the FMA latency, and it wasn't even worth using the higher-ILP shorter-latency polynomial evaluation method that VCL's polynomial_5() function uses ("Estrin's scheme", which does some non-FMA multiplies before the FMAs, resulting in more total instructions).

Agner Fog's VCL is now Apache-licensed, so any project can just include it directly. If you want high accuracy, you should just use VCL directly. It's header-only, just inline functions, so it won't bloat your binary.

VCL's log float and double functions are in vectormath_exp.h. There are two main parts to the algorithm:

extract the exponent bits and convert that integer back into a float (after adjusting for the bias that IEEE FP uses).
extract the mantissa and OR in some exponent bits to get a vector of double values in the [0.5, 1.0) range. (Or (0.5, 1.0], I forget).

Further adjust this with if(mantissa <= SQRT2*0.5) { mantissa += mantissa; exponent++;}, and then mantissa -= 1.0.

Use a polynomial approximation to log(x) that is accurate around x=1.0. (For double, VCL's log_d() uses a ratio of two 5th-order polynomials. @harold says this is often good for precision. One division mixed in with a lot of FMAs doesn't usually hurt throughput, but it does have higher latency than an FMA. Using vrcpps + a Newton-Raphson iteration is typically slower than just using vdivps on modern hardware. Using a ratio also creates more ILP by evaluating two lower-order polynomials in parallel, instead of one high-order polynomial, and may lower overall latency vs. one long dep chain for a high-order polynomial (which would also accumulate significant rounding error along that one long chain).

Then add exponent + polynomial_approx_log(mantissa) to get the final log() result. VCL does this in multiple steps to reduce rounding error. ln2_lo + ln2_hi = ln(2). It's split up into a small and a large constant to reduce rounding error.

// res is the polynomial(adjusted_mantissa) result
// fe is the float exponent
// x is the adjusted_mantissa.  x2 = x*x;
res  = mul_add(fe, ln2_lo, res);             // res += fe * ln2_lo;
res += nmul_add(x2, 0.5, x);                 // res += x  - 0.5 * x2;
res  = mul_add(fe, ln2_hi, res);             // res += fe * ln2_hi;

You can drop the 2-step ln2 stuff and just use VM_LN2 if you aren't aiming for 0.5 or 1 ulp accuracy (or whatever this function actually provide; IDK.)

The x - 0.5*x2 part is really an extra polynomial term, I guess. This is what I meant by log base e being baked-in: you'd need a coefficient on those terms, or to get rid of that line and re-fit the polynomial coefficients for log2. You can't just multiply all the polynomial coefficients by a constant.

After that, it checks for underflow, overflow or denormal, and branches if any element in the vector needs special processing to produce a proper NaN or -Inf rather than whatever garbage we got from the polynomial + exponent. If your values are known to be finite and positive, you can comment out this part and get a significant speedup (even the checking before the branch takes several instructions).

http://gallium.inria.fr/blog/fast-vectorizable-math-approx/ some stuff about how to evaluate relative and absolute error in a polynomial approximation, and doing a minimax fix of the coefficients instead of just using a Taylor series expansion.
http://www.machinedlearnings.com/2011/06/fast-approximate-logarithm-exponential.html an interesting approach: it type-puns a float to uint32_t, and converts that integer to float. Since IEEE binary32 floats store the exponent in higher bits than the mantissa, the resulting float mostly represents the value of the exponent, scaled by 1 << 23, but also containing information from the mantissa.

Then it uses an expression with a couple coefficients to fix things up and get a log() approximation. It includes a division by (constant + mantissa) to correct for the mantissa pollution when converting the float bit-pattern to float. I found that a vectorized version of that was slower and less accurate with AVX2 on HSW and SKL than JRF fastlog with 4th-order polynomials. (Especially when using it as part of a fast arcsinh which also uses the divide unit for vsqrtps.)

@SergeRogatch: If you don't need close to full 53-bit accuracy, a simple polynomial (or ratio of two polynomials) should work well. You can probably leave out all the order-of-addition tricks that VCL uses. Go for something like the JRF `float` version, but with a ratio of two 4th or 5th order polynomials. (And probably still with that `poly * (mantissa - 1.0)` at the end to make sure it goes to zero when it should). — Peter Cordes, Aug 21 '17 at 16:45

Serge Rogatch · Accepted Answer · 2017-08-27T12:19:03.107

Finally here is my best result which on Ryzen 1800X @3.6GHz gives about 0.8 billion of logarithms per second (200 million vectors of 4 logarithms in each) in a single thread, and is accurate till a few last bits in the mantissa. Spoiler: see in the end how to increase performance to 0.87 billion logarithms per second.

Special cases: Negative numbers, negative infinity and NaNs with negative sign bit are treated as if they are very close to 0 (result in some garbage large negative "logarithm" values). Positive infinity and NaNs with positive sign bit result in a logarithm around 1024. If you don't like how special cases are treated, one option is to add code that checks for them and does what suits you better. This will make the computation slower.

namespace {
  // The limit is 19 because we process only high 32 bits of doubles, and out of
  //   20 bits of mantissa there, 1 bit is used for rounding.
  constexpr uint8_t cnLog2TblBits = 10; // 1024 numbers times 8 bytes = 8KB.
  constexpr uint16_t cZeroExp = 1023;
  const __m256i gDoubleNotExp = _mm256_set1_epi64x(~(0x7ffULL << 52));
  const __m256d gDoubleExp0 = _mm256_castsi256_pd(_mm256_set1_epi64x(1023ULL << 52));
  const __m256i cAvxExp2YMask = _mm256_set1_epi64x(
    ~((1ULL << (52-cnLog2TblBits)) - 1) );
  const __m256d cPlusBit = _mm256_castsi256_pd(_mm256_set1_epi64x(
    1ULL << (52 - cnLog2TblBits - 1)));
  const __m256d gCommMul1 = _mm256_set1_pd(2.0 / 0.693147180559945309417); // 2.0/ln(2)
  const __m256i gHigh32Permute = _mm256_set_epi32(0, 0, 0, 0, 7, 5, 3, 1);
  const __m128i cSseMantTblMask = _mm_set1_epi32((1 << cnLog2TblBits) - 1);
  const __m128i gExpNorm0 = _mm_set1_epi32(1023);
  // plus |cnLog2TblBits|th highest mantissa bit
  double gPlusLog2Table[1 << cnLog2TblBits];
} // anonymous namespace

void InitLog2Table() {
  for(uint32_t i=0; i<(1<<cnLog2TblBits); i++) {
    const uint64_t iZp = (uint64_t(cZeroExp) << 52)
      | (uint64_t(i) << (52 - cnLog2TblBits)) | (1ULL << (52 - cnLog2TblBits - 1));
    const double zp = *reinterpret_cast<const double*>(&iZp);
    const double l2zp = std::log2(zp);
    gPlusLog2Table[i] = l2zp;
  }
}

__m256d __vectorcall Log2TblPlus(__m256d x) {
  const __m256d zClearExp = _mm256_and_pd(_mm256_castsi256_pd(gDoubleNotExp), x);
  const __m256d z = _mm256_or_pd(zClearExp, gDoubleExp0);

  const __m128i high32 = _mm256_castsi256_si128(_mm256_permutevar8x32_epi32(
    _mm256_castpd_si256(x), gHigh32Permute));
  // This requires that x is non-negative, because the sign bit is not cleared before
  //   computing the exponent.
  const __m128i exps32 = _mm_srai_epi32(high32, 20);
  const __m128i normExps = _mm_sub_epi32(exps32, gExpNorm0);

  // Compute y as approximately equal to log2(z)
  const __m128i indexes = _mm_and_si128(cSseMantTblMask,
    _mm_srai_epi32(high32, 20 - cnLog2TblBits));
  const __m256d y = _mm256_i32gather_pd(gPlusLog2Table, indexes,
    /*number of bytes per item*/ 8);
  // Compute A as z/exp2(y)
  const __m256d exp2_Y = _mm256_or_pd(
    cPlusBit, _mm256_and_pd(z, _mm256_castsi256_pd(cAvxExp2YMask)));

  // Calculate t=(A-1)/(A+1). Both numerator and denominator would be divided by exp2_Y
  const __m256d tNum = _mm256_sub_pd(z, exp2_Y);
  const __m256d tDen = _mm256_add_pd(z, exp2_Y);

  // Compute the first polynomial term from "More efficient series" of https://en.wikipedia.org/wiki/Logarithm#Power_series
  const __m256d t = _mm256_div_pd(tNum, tDen);

  const __m256d log2_z = _mm256_fmadd_pd(t, gCommMul1, y);

  // Leading integer part for the logarithm
  const __m256d leading = _mm256_cvtepi32_pd(normExps);

  const __m256d log2_x = _mm256_add_pd(log2_z, leading);
  return log2_x;
}

It uses a combination of lookup table approach and a 1st degree polynomial, mostly described on Wikipedia (the link is in the code comments). I can afford to allocate 8KB of L1 cache here (which is a half of 16KB L1 cache available per logical core), because logarithm computation is really the bottleneck for me and there is not much more anything that needs L1 cache.

However, if you need more L1 cache for the other needs, you can decrease the amount of cache used by logarithm algorithm by reducing cnLog2TblBits to e.g. 5 at expense of decreasing the accuracy of logarithm computation.

Or to keep the accuracy high, you can increase the number of polynomial terms by adding:

namespace {
  // ...
  const __m256d gCoeff1 = _mm256_set1_pd(1.0 / 3);
  const __m256d gCoeff2 = _mm256_set1_pd(1.0 / 5);
  const __m256d gCoeff3 = _mm256_set1_pd(1.0 / 7);
  const __m256d gCoeff4 = _mm256_set1_pd(1.0 / 9);
  const __m256d gCoeff5 = _mm256_set1_pd(1.0 / 11);
}

And then changing the tail of Log2TblPlus() after line const __m256d t = _mm256_div_pd(tNum, tDen);:

  const __m256d t2 = _mm256_mul_pd(t, t); // t**2

  const __m256d t3 = _mm256_mul_pd(t, t2); // t**3
  const __m256d terms01 = _mm256_fmadd_pd(gCoeff1, t3, t);
  const __m256d t5 = _mm256_mul_pd(t3, t2); // t**5
  const __m256d terms012 = _mm256_fmadd_pd(gCoeff2, t5, terms01);
  const __m256d t7 = _mm256_mul_pd(t5, t2); // t**7
  const __m256d terms0123 = _mm256_fmadd_pd(gCoeff3, t7, terms012);
  const __m256d t9 = _mm256_mul_pd(t7, t2); // t**9
  const __m256d terms01234 = _mm256_fmadd_pd(gCoeff4, t9, terms0123);
  const __m256d t11 = _mm256_mul_pd(t9, t2); // t**11
  const __m256d terms012345 = _mm256_fmadd_pd(gCoeff5, t11, terms01234);

  const __m256d log2_z = _mm256_fmadd_pd(terms012345, gCommMul1, y);

Then comment // Leading integer part for the logarithm and the rest unchanged follow.

Normally you don't need that many terms, even for a few-bit table, I just provided the coefficients and computations for reference. It's likely that if cnLog2TblBits==5, you won't need anything beyond terms012. But I haven't done such measurements, you need to experiment what suits your needs.

The less polynomial terms you compute, obviously, the faster the computations are.

EDIT: this question In what situation would the AVX2 gather instructions be faster than individually loading the data? suggests that you may get a performance improvement if

const __m256d y = _mm256_i32gather_pd(gPlusLog2Table, indexes,
  /*number of bytes per item*/ 8);

is replaced by

const __m256d y = _mm256_set_pd(gPlusLog2Table[indexes.m128i_u32[3]],
  gPlusLog2Table[indexes.m128i_u32[2]],
  gPlusLog2Table[indexes.m128i_u32[1]],
  gPlusLog2Table[indexes.m128i_u32[0]]);

For my implementation it saves about 1.5 cycle, reducing the total cycle count to compute 4 logarithms from 18 to 16.5, thus the performance rises to 0.87 billion logarithms per second. I'm leaving the current implementation as is because it's more idiomatic and shoud be faster once the CPUs start doing gather operations right (with coalescing like GPUs do).

EDIT2: on Ryzen CPU (but not on Intel) you can get a little more speedup (about 0.5 cycle) by replacing

const __m128i high32 = _mm256_castsi256_si128(_mm256_permutevar8x32_epi32(
  _mm256_castpd_si256(x), gHigh32Permute));

with

  const __m128 hiLane = _mm_castpd_ps(_mm256_extractf128_pd(x, 1));
  const __m128 loLane = _mm_castpd_ps(_mm256_castpd256_pd128(x));
  const __m128i high32 = _mm_castps_si128(_mm_shuffle_ps(loLane, hiLane,
    _MM_SHUFFLE(3, 1, 3, 1)));

Skylake already has efficient gathers (one `vgatherdpd ymm` per 4 cycles, vs. Ryzen's 12c throughput). I'm surprised that a table lookup was better than a bigger polynomial. I guess that's only in a microbenchmark, or in a loop that does mostly `log()` calculations for a very long time. And `double` does need bigger polynomials, so maybe it's not so crazy. Also, Ryzen has about half the FMA throughput of Intel. — Peter Cordes, Aug 27 '17 at 03:02
For Intel CPUs, it would be more efficient to use `i64gather_pd` and 256b vectors, instead of packing down to a `__m128i` for `i32gather_pd`. e.g. use `exps32 = _mm_srli_epi64(x, 20 + 32);`. (Why were you using an arithmetic shift, instead of logical? Did you need the sign-bit broadcast)? Extracting high32 is maybe good for Ryzen, though, because 256b vector instructions are 2 uops instead of 1. So you spend 2 uops extracting to save 4 uops on `__m128i` instructions instead of `__m256i` — Peter Cordes, Aug 27 '17 at 03:10
Using global `const __m256d` constants is probably not helpful. You'd think it would be better this way, but actually you end up with a "constructor" that copies from `.rodata` into the storage for those `const __m256d` variables. i.e. they have non-constant initializers, because `_mm_set` doesn't optimize away at global scope :( See _GLOBAL__sub_I__Z13InitLog2Tablev in https://godbolt.org/g/x8aW62. It's usually best to write your vector constants inside your function, and let the compiler deal with them the same way it deals with string literals like `"hello"` across multiple functions. — Peter Cordes, Aug 27 '17 at 03:47
OTOH, maybe that's what you want, if it puts the constants in memory next to your LUT for better locality. — Peter Cordes, Aug 27 '17 at 04:07
@PeterCordes, I need 32-bit numbers anyway for `_mm256_cvtepi32_pd(normExps)` (because there doesn't seem to be a 64-bit version of it), so I thought 128-bit vector operations would be faster and used them where possible. I used the arithmetic shift (with sign broadcast) to get what suits me better for negative `x`: a large negative logarithm, as if `x` was positive and very close to 0. Is arithmetic shift slower than logical? Where are you looking for latencies and throughputs? And yes, I want better locality for constants. Inline `_set1_` calls seem to give the same performance. — Serge Rogatch, Aug 27 '17 at 10:13
Arithmetic and logical are the same perf (according to [Agner Fog's tables](http://agner.org/optimize/) of course). But there is no `VPSRAQ`, only W/D sizes for arithmetic. On Intel CPUs, 256b vector ops are the same speed as 128b. But you're right that 128b is faster on Ryzen. Anyway, you're right that `_mm256_cvtepi32_pd` requires packing to 128b at some point. With AVX512 you could keep it in-lane and use `cvtepi64_pd`. — Peter Cordes, Aug 27 '17 at 13:57
If the gather is part of the critical path, it would reduce latency to get the indices ready sooner by using an i64gather (no shuffle, and just a larger shift count. Or if you really want an arithmetic shift, you might do arithmetic >>20 and then logical >>32) — Peter Cordes, Aug 27 '17 at 13:59
re: the constants: if you test in a loop, a good compiler will hoist the loads of most of them (into registers outside the loop) so it doesn't matter if they're near the table or not. If they don't all fit, then some will need to stay hot in cache. — Peter Cordes, Aug 27 '17 at 14:00
@PeterCordes, yes, at least in the benchmarking code the disassembly shows that the constants simply stay in registers. It's the production code that will likely not be able to keep them in the registers. The production code will compute entropies of thousands of arrays, each containing millions of items. So it will do billions of `H += P[i]*log2(P[i])` computations, that's why this implementation is good for me if LUT and constants fit into L1 cache. — Serge Rogatch, Aug 27 '17 at 19:38

Efficient implementation of log2(__m256d) in AVX2

2 Answers2

Further reading:

Linked