19

As the title shows. I need to do a lot of the calculation like this:

re = (x^a + y^a + z^a)^(1/a).

where {x, y, z} >= 0. more specific, a is a positive floating point constant, and x, y, z are floating point numbers. The ^ is an exponentiation operator.

Currently, I'd prefer not to use SIMD, but hope for some other trick to speed it up.

static void heavy_load(void) {
  static struct xyz_t {
    float x,y,z;
  };
  struct xyz_t xyzs[10000];
  float re[10000] = {.0f};
  const float a = 0.2;

  /* here fill xyzs using some random positive floating point values */

  for (i = 0; i < 10000; ++ i) {
    /* here is what we need to optimize */
    re[i] = pow((pow(xyzs[i].x, a) + pow(xyzs[i].y, a) + pow(xyzs[i].z, a)), 1.0/a);
  }
}
phuclv
  • 37,963
  • 15
  • 156
  • 475
blackball
  • 718
  • 1
  • 6
  • 19
  • 7
    Do you actually need to know `re`? Often you can get by computing `re^a`, which saves you an exponentiation. Can you provide more context? – templatetypedef Sep 26 '13 at 02:52
  • 1
    I agree your question is ill-posed. But it's likely the only optimization possible is if `a` happens to be an integer, when you can optimize the inner power ops by repeated factors and the root with `sqrt` when `a` == 2. In the general case you'll be using the `pow` library function 4 times. – Gene Sep 26 '13 at 03:00
  • Have you already exhausted your options for parallel optimization and caching of intermediate values? – paddy Sep 26 '13 at 03:01
  • @Gene, I agree with your point, but my *a* here is always 0.1 or 0.2 – blackball Sep 26 '13 at 03:52
  • 3
    If your a is always 0.1 or 0.2, you can at least elide one call to `pow()` by coding the last exponentiation as a multiplication, because that last exponentiation would have an exponent of 10 or 5, respectively. That should bring your time down by around 20%. – cmaster - reinstate monica Sep 26 '13 at 04:10
  • Well, you can't mathematically simplify it, that's for sure. You should run it through profiler and see what takes the most time. – SigTerm Sep 26 '13 at 04:24
  • What is are the numeric limits of your variables? I.e., You've said they're positive, but what is their range? – phonetagger Sep 26 '13 at 04:43
  • 1
    @cmaster: How can you replace (x^10) or (x^5) with multiplication? – Kuba hasn't forgotten Monica Sep 26 '13 at 04:47
  • 1
    @KubaOber: If you enable the -ffast-math option then gcc will automatically optimize it for you. Look at the answer [here](http://stackoverflow.com/questions/6430448/why-doesnt-gcc-optimize-aaaaaa-to-aaaaaa?rq=1) – phuclv Sep 26 '13 at 05:06
  • @cmaster: Well, you replace it with multiple multiplications, so whether it's a net gain you'd have to measure :) Of course if you could get it done using SSE, then it'd be likely worth it. – Kuba hasn't forgotten Monica Sep 26 '13 at 05:09
  • @user1786323: What compiler (including version) do you use to compile this code? What platform does it run on? Using SIMD has a real chance of speeding it up, but we'd need to know your compiler, as alignment attributes will vary. – Kuba hasn't forgotten Monica Sep 26 '13 at 05:21
  • 1
    @Kuba I think it's not difficult to use SIMD to speed this tiny codes up. That's why I said I did not prefer SIMD stuff in this discussion. – blackball Sep 26 '13 at 05:29
  • Do you want to speed it up or not? :) – Kuba hasn't forgotten Monica Sep 26 '13 at 05:37
  • @KubaOber `pow()` is an expensive function, much more expensive than the three to four cycles you need to do the floating point multiplication. As a matter of fact, even the overhead involved in calling a function consumes more cycles than the multiplications. So, no, I don't need to start profiling to know which one is faster... – cmaster - reinstate monica Sep 26 '13 at 05:57
  • @KubaOber you do this: `a = x*x; b = a*a; result = b*x;` for the `x^5` case. For the `x^10` case, you need to add a third variable `c = b*b` to get the intermediate `x^8` result. – cmaster - reinstate monica Sep 26 '13 at 06:16
  • @user1786323: Do you intend to use arbitrary values of `a`, or one from a small set; if so - what's in that set? – Kuba hasn't forgotten Monica Sep 26 '13 at 06:51
  • @Kuba a is either 0.1 or 0.2. – blackball Sep 26 '13 at 07:04
  • 1
    @user1786323: There's no need for that attitude. You haven't explained *why* you want to avoid SIMD. What is the goal here? Are you running on some weird architecture? Improving your question may help focus/improve the answers. – Oliver Charlesworth Sep 26 '13 at 11:01
  • @jthill It's my first time hearing about Runge-Kutta. Could you give some details about doing the stuff ? – blackball Sep 26 '13 at 16:44

3 Answers3

14

The first thing to do is to get rid of one of those exponentiations by factoring out one of the terms. The code below uses

(x^a + y^a + z^a)^(1/a) = x * ((1 + (y/x)^a + (z/x)^a)^(1/a))

Factoring out the largest of the three would be a bit safer, and perhaps more accurate.

Another optimization is to take advantage of the fact that a is either 0.1 or 0.2. Use a Chebychev polynomial approximation to approximate x^a. The code below only has an approximation for x^0.1; x^0.2 is simply that squared. Finally, since 1/a is a small integer (5 or 10), the final exponentiation can be replaced with a small number of multiplications.

To see what's going on in the functions powtenth and powtenthnorm see this answer: Optimizations for pow() with const non-integer exponent? .

#include <stdlib.h>
#include <math.h>


float powfive (float x);
float powtenth (float x);
float powtenthnorm (float x);

// Returns (x^0.2 + y^0.2 + z^0.2)^5
float pnormfifth (float x, float y, float z) {
   float u = powtenth(y/x);
   float v = powtenth(z/x);
   return x * powfive (1.0f + u*u + v*v);
}

// Returns (x^0.1 + y^0.1 + z^0.1)^10
float pnormtenth (float x, float y, float z) {
   float u = powtenth(y/x);
   float v = powtenth(z/x);
   float w = powfive (1.0f + u + v);
   return x * w * w;
}

// Returns x^5
float powfive (float x) {
   float xsq = x*x;
   return xsq*xsq*x;
}

// Returns x^0.1.
float powtenth (float x) {
   static const float pow2_tenth[10] = {
      1.0,
      pow(2.0, 0.1),
      pow(4.0, 0.1),
      pow(8.0, 0.1),
      pow(16.0, 0.1),
      pow(32.0, 0.1),
      pow(64.0, 0.1),
      pow(128.0, 0.1),
      pow(256.0, 0.1),
      pow(512.0, 0.1)
   };

   float s;
   int iexp;

   s = frexpf (x, &iexp);
   s *= 2.0;
   iexp -= 1;

   div_t qr = div (iexp, 10);
   if (qr.rem < 0) {
      qr.quot -= 1;
      qr.rem += 10;
   }

   return ldexpf (powtenthnorm(s)*pow2_tenth[qr.rem], qr.quot);
}

// Returns x^0.1 for x in [1,2), to within 1.2e-7 (relative error).
// Want more precision? Add more Chebychev polynomial coefs.
float powtenthnorm (float x) {
   static const int N = 8;

   // Chebychev polynomial terms.
   // Non-zero terms calculated via
   //   integrate (2/pi)*ChebyshevT[n,u]/sqrt(1-u^2)*((u+3)/2)^0.1
   //   from -1 to 1
   // Zeroth term is similar except it uses 1/pi rather than 2/pi.
   static const float Cn[N] = {
       1.0386703502389972,
       3.55833786872637476e-2,
      -2.7458105122604368629e-3,
       2.9828558990819401155e-4,
      -3.70977182883591745389e-5,
       4.96412322412154169407e-6,
      -6.9550743747592849e-7,
       1.00572368333558264698e-7};

   float Tn[N];

   float u = 2.0*x - 3.0;


   Tn[0] = 1.0;
   Tn[1] = u;
   for (int ii = 2; ii < N; ++ii) {
      Tn[ii] = 2*u*Tn[ii-1] - Tn[ii-2];
   }

   float y = 0.0;
   for (int ii = N-1; ii > 0; --ii) {
      y += Cn[ii]*Tn[ii];
   }

   return y + Cn[0];
}
Community
  • 1
  • 1
David Hammen
  • 32,454
  • 9
  • 60
  • 108
  • Your answer is really inspiring. But I'm thinking in another direction: Let *lg(x) = (log_2)^(x)*, Then *x^a = 2^(a * lg(x)) = 2^a * 2^(lg(x))*. I wonder if it's more easy to do optimization for *lg* and *2^x*. – blackball Sep 27 '13 at 08:38
  • You are correct that *x^a=2^(a*log2(x))*. The next step is incorrect. *2^a* * *2^(log2(x))* is *2^(a+log2(x))*, which is not equal to *x^a*. – David Hammen Sep 27 '13 at 12:01
4

A generic optimization would be to improve cache locality. Keep the result together with the arguments. Using powf instead of pow will prevent time wasted doing a float-double round trip. Any reasonable compiler will hoist 1/a out of the loop, so that isn't a problem.

You should profile whether auto-vectorization possibly done by the compiler would be fooled by not using explicit indexing on the pointer to the array.

typedef struct {
  float x,y,z, re;
} element_t;

void heavy_load(element_t * const elements, const int num_elts, const float a) {
  element_t * elts = elements;
  for (i = 0; i < num_elts; ++ i) {
    elts->re = 
      powf((powf(elts->x, a) + powf(elts->y, a) + powf(elts->z, a)), 1.0/a);
    elts ++;
  }
}

Using SSE2 SIMD, it would look like below. I'm using and including an excerpt from Julien Pommier's sse_mathfun.h for the exponentation; the license is appended below.

#include <emmintrin.h>
#include <math.h>
#include <assert.h>

#ifdef _MSC_VER /* visual c++ */
# define ALIGN16_BEG __declspec(align(16))
# define ALIGN16_END
#else /* gcc or icc */
# define ALIGN16_BEG
# define ALIGN16_END __attribute__((aligned(16)))
#endif

typedef __m128 v4sf;  // vector of 4 float (sse1)
typedef __m128i v4si; // vector of 4 int (sse2)

#define _PS_CONST(Name, Val)                                            \
    static const ALIGN16_BEG float _ps_##Name[4] ALIGN16_END = { Val, Val, Val, Val }
#define _PI32_CONST(Name, Val)                                            \
    static const ALIGN16_BEG int _pi32_##Name[4] ALIGN16_END = { Val, Val, Val, Val }

_PS_CONST(1  , 1.0f);
_PS_CONST(0p5, 0.5f);
_PI32_CONST(0x7f, 0x7f);
_PS_CONST(exp_hi,   88.3762626647949f);
_PS_CONST(exp_lo,   -88.3762626647949f);
_PS_CONST(cephes_LOG2EF, 1.44269504088896341);
_PS_CONST(cephes_exp_C1, 0.693359375);
_PS_CONST(cephes_exp_C2, -2.12194440e-4);
_PS_CONST(cephes_exp_p0, 1.9875691500E-4);
_PS_CONST(cephes_exp_p1, 1.3981999507E-3);
_PS_CONST(cephes_exp_p2, 8.3334519073E-3);
_PS_CONST(cephes_exp_p3, 4.1665795894E-2);
_PS_CONST(cephes_exp_p4, 1.6666665459E-1);
_PS_CONST(cephes_exp_p5, 5.0000001201E-1);

v4sf exp_ps(v4sf x) {
    v4sf tmp = _mm_setzero_ps(), fx;
    v4si emm0;
    v4sf one = *(v4sf*)_ps_1;

    x = _mm_min_ps(x, *(v4sf*)_ps_exp_hi);
    x = _mm_max_ps(x, *(v4sf*)_ps_exp_lo);

    fx = _mm_mul_ps(x, *(v4sf*)_ps_cephes_LOG2EF);
    fx = _mm_add_ps(fx, *(v4sf*)_ps_0p5);

    emm0 = _mm_cvttps_epi32(fx);
    tmp  = _mm_cvtepi32_ps(emm0);

    v4sf mask = _mm_cmpgt_ps(tmp, fx);
    mask = _mm_and_ps(mask, one);
    fx = _mm_sub_ps(tmp, mask);

    tmp = _mm_mul_ps(fx, *(v4sf*)_ps_cephes_exp_C1);
    v4sf z = _mm_mul_ps(fx, *(v4sf*)_ps_cephes_exp_C2);
    x = _mm_sub_ps(x, tmp);
    x = _mm_sub_ps(x, z);

    z = _mm_mul_ps(x,x);

    v4sf y = *(v4sf*)_ps_cephes_exp_p0;
    y = _mm_mul_ps(y, x);
    y = _mm_add_ps(y, *(v4sf*)_ps_cephes_exp_p1);
    y = _mm_mul_ps(y, x);
    y = _mm_add_ps(y, *(v4sf*)_ps_cephes_exp_p2);
    y = _mm_mul_ps(y, x);
    y = _mm_add_ps(y, *(v4sf*)_ps_cephes_exp_p3);
    y = _mm_mul_ps(y, x);
    y = _mm_add_ps(y, *(v4sf*)_ps_cephes_exp_p4);
    y = _mm_mul_ps(y, x);
    y = _mm_add_ps(y, *(v4sf*)_ps_cephes_exp_p5);
    y = _mm_mul_ps(y, z);
    y = _mm_add_ps(y, x);
    y = _mm_add_ps(y, one);

    emm0 = _mm_cvttps_epi32(fx);
    emm0 = _mm_add_epi32(emm0, *(v4si*)_pi32_0x7f);
    emm0 = _mm_slli_epi32(emm0, 23);
    v4sf pow2n = _mm_castsi128_ps(emm0);
    y = _mm_mul_ps(y, pow2n);
    return y;
}

typedef struct {
    float x,y,z, re;
} element_t;   

void fast(element_t * const elements, const int num_elts, const float a) {
    element_t * elts = elements;
    float logf_a = logf(a);
    float logf_1_a = logf(1.0/a);
    v4sf log_a = _mm_load1_ps(&logf_a);
    v4sf log_1_a = _mm_load1_ps(&logf_1_a);
    assert(num_elts % 3 == 0); // operates on 3 elements at a time

    // elts->re = powf((powf(elts->x, a) + powf(elts->y, a) + powf(elts->z, a)), 1.0/a);
    for (int i = 0; i < num_elts; i += 3) {
        // transpose
        // we save one operation over _MM_TRANSPOSE4_PS by skipping the last row of output
        v4sf r0 = _mm_load_ps(&elts[0].x); // x1,y1,z1,0
        v4sf r1 = _mm_load_ps(&elts[1].x); // x2,y2,z2,0
        v4sf r2 = _mm_load_ps(&elts[2].x); // x3,y3,z3,0
        v4sf r3 = _mm_setzero_ps();        // 0, 0, 0, 0
        v4sf t0 = _mm_unpacklo_ps(r0, r1); //  x1,x2,y1,y2
        v4sf t1 = _mm_unpacklo_ps(r2, r3); //  x3,0, y3,0
        v4sf t2 = _mm_unpackhi_ps(r0, r1); //  z1,z2,0, 0
        v4sf t3 = _mm_unpackhi_ps(r2, r3); //  z3,0, 0, 0
        r0 = _mm_movelh_ps(t0, t1);        // x1,x2,x3,0
        r1 = _mm_movehl_ps(t1, t0);        // y1,y2,y3,0
        r2 = _mm_movelh_ps(t2, t3);        // z1,z2,z3,0
        // perform pow(x,a),.. using the fact that pow(x,a) = exp(x * log(a))
        v4sf r0a = _mm_mul_ps(r0, log_a); // x1*log(a), x2*log(a), x3*log(a), 0
        v4sf r1a = _mm_mul_ps(r1, log_a); // y1*log(a), y2*log(a), y3*log(a), 0
        v4sf r2a = _mm_mul_ps(r2, log_a); // z1*log(a), z2*log(a), z3*log(a), 0
        v4sf ex0 = exp_ps(r0a); // pow(x1, a), ..., 0
        v4sf ex1 = exp_ps(r1a); // pow(y1, a), ..., 0
        v4sf ex2 = exp_ps(r2a); // pow(z1, a), ..., 0
        // sum
        v4sf s1 = _mm_add_ps(ex0, ex1);
        v4sf s2 = _mm_add_ps(sum1, ex2);
        // pow(sum, 1/a) = exp(sum * log(1/a))
        v4sf ps = _mm_mul_ps(s2, log_1_a);
        v4sf es = exp_ps(ps);
        ALIGN16_BEG float re[4] ALIGN16_END;
        _mm_store_ps(re, es);
        elts[0].re = re[0];
        elts[1].re = re[1];
        elts[2].re = re[2];
        elts += 3;
    }
}


/* Copyright (C) 2007  Julien Pommier
  This software is provided 'as-is', without any express or implied
  warranty.  In no event will the authors be held liable for any damages
  arising from the use of this software.
  Permission is granted to anyone to use this software for any purpose,
  including commercial applications, and to alter it and redistribute it
  freely, subject to the following restrictions:
  1. The origin of this software must not be misrepresented; you must not
     claim that you wrote the original software. If you use this software
     in a product, an acknowledgment in the product documentation would be
     appreciated but is not required.
  2. Altered source versions must be plainly marked as such, and must not be
     misrepresented as being the original software.
  3. This notice may not be removed or altered from any source distribution. */
Kuba hasn't forgotten Monica
  • 95,931
  • 16
  • 151
  • 313
  • 2
    Whenever functions like `pow()` are involved, cache coherency is a minor issue. Also, I really don't buy your assertion that you should keep arguments and output together - while it is possible that it helps a little bit, I think it could just as well be counterproductive. Can you put numbers behind that claim? – cmaster - reinstate monica Sep 26 '13 at 06:06
  • I basically agree; this is likely to be heavily compute-bound, so memory efficiency issues are likely to be relatively minor in comparison. – Oliver Charlesworth Sep 26 '13 at 10:13
3

Some C optimizations - not much, but something.

[Edit] Sorry - back to annoy: If the FP values vary widely then quite often re = a (or b or c). Doing a test for large magnitude differences will negate the need to call pow() on some of x, y, or z. This helps the average, but not worst case time.

Replace 1.0/a with a_inverse which is set before the loop.

Replace pow() with powf(), else you are invoking the double version of pow().

Minor: Initialization in float re[10000] = { 0.0f} is not needed with other than to freshen a memory cache.

Minor: using a pointer rather than indexing arrays may save a bit.

Minor: For select platforms, using type double may run faster. - reversing my above pow()/powf() comment.

Minor: Try 3 separate arrays of x,y,z. Each float *.

Obviously profiling helps but let's assume that is well know here.

chux - Reinstate Monica
  • 143,097
  • 13
  • 135
  • 256
  • Separation of the arrays is *bad* advice. It breaks cache coherency. – Kuba hasn't forgotten Monica Sep 26 '13 at 04:26
  • @Kuba Ober Agree with the cache coherency issue in general, but I don't even know OP's systems has a cache (Maybe its stated somewhere and I missed it). I maintain it is not bad advise to try it - just maybe unlikely to help. Profiling will be the judge. – chux - Reinstate Monica Sep 26 '13 at 04:36
  • 1
    I don't think there's any reasonable hardware out there with a floating point unit that runs C and doesn't have a cache. Cache coherency is like the most basic thing to look for, ever, so suggesting otherwise is a premature pessimization. You should not have suggested it. It will **never** help. On any hardware that's currently sold it will **always** make things worse. I mean it. – Kuba hasn't forgotten Monica Sep 26 '13 at 04:37
  • @Kuba Ober I do a fair amount in embedded in C - pretty much a state machine in many ways - no cache and these small computers are use a lot. – chux - Reinstate Monica Sep 26 '13 at 04:39
  • This is not embedded C running on some puny 8 bit microcontroller. If it were, the "optimization" would be to code the whole thing in assembly since typical vendor floating point libraries on micros suck big time. – Kuba hasn't forgotten Monica Sep 26 '13 at 04:40
  • Almost none of "optimizations" that matter on small microcontrollers make any sense on any other hardware. In fact, many of them are pessimizations. Any reasonable compiler for a PC or ARM platform will completely ignore some of your advice anyway. – Kuba hasn't forgotten Monica Sep 26 '13 at 04:42
  • 1
    @Kuba Ober referring to embedded processors as puny 8 bit microcontroller is not up to date with currently popular 16 and 32-bit processors. OP's question does not asserted a platform. Questions about performance of FP math occur often in PICs as well as vector processors. Coding the whole thing in assembly is only an option for a shrinking portion of the growing embedded market segment. – chux - Reinstate Monica Sep 26 '13 at 05:09
  • @Kuba Ober Microchip Technology ships over a billion processors every year. Embedded is not insignificant in 2013. Optimized calculations are important here too. – chux - Reinstate Monica Sep 26 '13 at 05:18
  • 1. You missed the reference to SIMD. This is for a PC or similar "big" platform. 2. Anything 32 bit will likely have caches, probably tiny caches that will be killed by your multiple-arrays pessimization. 3. If your embedded platform has a compiler so poor that it can't deal with hoisting `1/a` out of the loop or to which array indexing vs. pointer + increment matters, it's time to switch platforms. I've realized that a while ago. Life's too short :) – Kuba hasn't forgotten Monica Sep 26 '13 at 05:18
  • 1
    @Kuba Ober Wrong again as I did not miss the reference. OP " ... prefer not to use SIMD". I took this to mean OP wanted, for whatever reason, to avoid using that feature available in that class of parallel computers. Maybe to make code fast on machine that did not have that capability. – chux - Reinstate Monica Sep 26 '13 at 05:26
  • @Kuba Ober You do have a good point in "switch platforms". I have also found alternate compilers, on all levels of processors, can be an easy way to boost performance. We agree on at least something :) – chux - Reinstate Monica Sep 26 '13 at 05:32
  • Array indexing _is_ pointer arithmetic, and compilers are smart enough to optimize the multiplication with the element size away. So, from a performance perspective, both approaches are exactly equivalent. – cmaster - reinstate monica Sep 26 '13 at 06:13