7

I need a AVX512 double pow(double, int n) function (I need it for a binomial distribution calculation which needs to be exact). In particular I would like this for Knights Landing which has AVX512ER. One way to get this is

x^n = exp2(log2(x)*n)

Knights Corner has the vlog2ps instruction (_mm512_log2_ps intrinsic) and the vexp223ps instruction (_mm512_exp223_ps intrinsic) so at least I could do float pow(float, float) with those two instructions.

However, with Knights Landing I don't find a log2 instruction. I do find a vexp2pd instruction (_mm512_exp2a23_pd intrinsic) in AVX512ER. I find it strange that Knights Corner has a log2 instruction but Knights Landing which is newer and better does not.

For now I have implemented pow(double, n) using repeated squaring but I think it would be more efficient if I had a log2 instruction.

//AVX2 but easy to convert to AVX512 with mask registers
static __m256d pown_AVX2(__m256d base, __m256i exp) {
  __m256d result = _mm256_set1_pd(1.0);
  int mask = _mm256_testz_si256(exp, exp);
  __m256i onei = _mm256_set1_epi64x(1);
  __m256d onef = _mm256_set1_pd(1.0);
  while(!mask) {
    __m256i t1 = _mm256_and_si256(exp, onei);
    __m256i t2 = _mm256_cmpeq_epi64(t1, _mm256_setzero_si256());
    __m256d t3 = _mm256_blendv_pd(base, onef, _mm256_castsi256_pd(t2));
    result = _mm256_mul_pd(result, t3);
    exp = _mm256_srli_epi64(exp, 1);
    base = _mm256_mul_pd(base,base);
    mask = _mm256_testz_si256(exp, exp);
  }
  return result;
}

Is there a more efficient algorithm to get double pow(double, int n) with AVX512 and AVX512ER than repeated squaring? Is there an easy method (e.g. with a few instructions) to get log2?


Here is the AVX512F version using repeated squaring

static  __m512d pown_AVX512(__m512d base, __m512i pexp) {
  __m512d result = _mm512_set1_pd(1.0);
  __m512i onei = _mm512_set1_epi32(1);
  __mmask8 mask;
  do {
    __m512i t1 = _mm512_and_epi32(pexp, onei);
    __mmask8 mask2 = _mm512_cmp_epi32_mask(onei, t1, 0);
    result = _mm512_mask_mul_pd(result, mask2, result, base);
    pexp = _mm512_srli_epi32(pexp, 1);
    base = _mm512_mul_pd(base,base);
    mask = _mm512_test_epi32_mask(pexp, pexp);
  } while(mask);
  return result;
}

The exponents are int32 not int64. Ideally I would use __m256i for the eight integers. However, this requires AVX512VL which extends the 512b operations to 256b and 128b but KNL does not have AVX512VL. Instead I use the 512b operations on 32-bit integers and I cast the 16b mask to 8b.

Community
  • 1
  • 1
Z boson
  • 32,619
  • 11
  • 123
  • 226
  • `x^n = log2(x)*2^n` is not an identity: 4^3 = 64 != log2(4)*2^3 = 16 – Margaret Bloom Feb 07 '17 at 11:41
  • 1
    mmm I found the `VGETEXPPD` that computes, according to Intel, floor(log2(|x|)). It should be in the AVX2 Foundation class. Can that be useful? – Margaret Bloom Feb 07 '17 at 13:47
  • @MargaretBloom, good fine. I actually saw those yesterday and forgot to mention them. I'm not sure if they are sufficient. I need `exp2(log2(x)*n)` not `exp2(floor(log2(x)*n))` but maybe another step would fix it. – Z boson Feb 07 '17 at 14:09
  • 1
    If you pre-compute the bit-length of the largest `n`, you can use `bsr` to loop over an integer which eliminates the `_mm256_testz_si256(exp, exp)`. – Mysticial Feb 07 '17 at 16:38
  • 1
    If you reverse the direction of the iteration and pre-shift `exp` so that you're testing the sign bit, you can eliminate both the `_mm256_and_si256(exp, onei)` and the `_mm256_cmpeq_epi64(t1, _mm256_setzero_si256())`. But that puts the squaring on the critical path. – Mysticial Feb 07 '17 at 16:40
  • @Mysticial, I understand your first comment about `bsr` I think. But how does that help? Is `vtestpd` sets the flag register. Even if I decremented an int to zero I would still need to test for zero each iteration. Maybe I'm not thinking straight right now but it seems to me I would be replacing `vtestpd` for `dec` and adding a `max` and `bsr` instruction before the loop. I am still thinking about your second comment. I will probably have to ask more tomorrow. – Z boson Feb 07 '17 at 20:39
  • If you call `bsr` on the largest exponent, you can get the exact number of iterations you need to run. That way your loop condition changes to an integer check instead of the more expensive `_mm256_testz_si256()`. Not sure if that's actually worth the savings though. – Mysticial Feb 07 '17 at 20:44
  • If `x` is fixed, you can precompute the powers of `x` and eliminate all the squares. If the exponent `n` is also constant among all the SIMD lanes, you can then skip entire iterations that correspond to zero-bits in `n`. And it's a perfect place to use `bsr` or `bsf` which also lets you branch out immediately when it reaches zero. – Mysticial Feb 07 '17 at 21:03
  • @Mysticial, neither `x` nor `n` are fixed. The range of `n` is limited to `n<=100` though. We might get more benefit from sorting `n` first. I will try your `bsr` suggestion. – Z boson Feb 08 '17 at 09:58
  • @Mysticial how do I get the maximum `n` for `bsr` efficiently? Horizontal max [is hideous](https://godbolt.org/g/7SYYq3). – Z boson Feb 09 '17 at 09:47
  • @Mysticial, I updated my answer with a AVX512 version which I have tested. If I could get maximum bit I could unroll the loop into a switch statement with 7 cases since `n<=100` the maximum bit is 7. – Z boson Feb 09 '17 at 09:58
  • I wonder what the down vote was for. I'm probably doing something silly with AVX512 which could be done better. – Z boson Feb 09 '17 at 12:42
  • It will depend on whether you have enough iterations to offset the cost of the horizontal max. The AVX512 version doesn't lend itself to abusing the sign-bit for the mask. If you have a lot of iterations (very large exponent), I wonder if the windowing algorithm is suitable using a permute-based lookup table. – Mysticial Feb 09 '17 at 16:29
  • 3
    @Zboson "Is there a more efficient algorithm to get double pow(double, int n) with AVX512 and AVX512ER than repeated squaring?" For small exponents like in your case (n<=100), you could probably use [addition-chain exponentiation](http://math.stackexchange.com/questions/127685/list-of-the-minimal-addition-chains) which will be slightly faster, but requires more memory. However, not sure if it's worth it. – n0p Feb 20 '17 at 11:21
  • An easy solution is to use the Vector Class Library. It has optimized power functions with integer and floating point powers. https://github.com/vectorclass/version2 – A Fog Oct 06 '20 at 11:34

0 Answers0