7

I want to implement SIMD minmag and maxmag functions. As far as I understand these functions are

minmag(a,b) = |a|<|b| ? a : b
maxmag(a,b) = |a|>|b| ? a : b

I want these for float and double and my target hardware is Haswell. What I really need is code which calculates both. Here is what I have for SSE4.1 for double (the AVX code is almost identical)

static inline void maxminmag(__m128d & a, __m128d & b) {
    __m128d mask    = _mm_castsi128_pd(_mm_setr_epi32(-1,0x7FFFFFFF,-1,0x7FFFFFFF));
    __m128d aa      = _mm_and_pd(a,mask);
    __m128d ab      = _mm_and_pd(b,mask);
    __m128d cmp     = _mm_cmple_pd(ab,aa);
    __m128d cmpi    = _mm_xor_pd(cmp, _mm_castsi128_pd(_mm_set1_epi32(-1)));
    __m128d minmag  = _mm_blendv_pd(a, b, cmp);
    __m128d maxmag  = _mm_blendv_pd(a, b, cmpi);
    a = maxmag, b = minmag;
}

However, this is not as efficient as I would like. Is there a better method or at least an alternative worth considering? I would like to try to avoid port 1 since I already have many additions/subtractions using that port. The _mm_cmple_pd instrinsic goes to port 1.

The main function I am interested is this:

//given |a| > |b|
static inline doubledouble4 quick_two_sum(const double4 & a, const double4 & b)  {
    double4 s = a + b;
    double4 e = b - (s - a);
    return (doubledouble4){s, e};
}

So what I am really after is this

static inline doubledouble4 two_sum_MinMax(const double4 & a, const double4 & b) {
    maxminmag(a,b);       
    return quick_to_sum(a,b);
}

Edit: My goal is for two_sum_MinMax to be faster than two_sum below:

static inline doubledouble4 two_sum(const double4 &a, const double4 &b) {
        double4 s = a + b;
        double4 v = s - a;
        double4 e = (a - (s - v)) + (b - v);
        return (doubledouble4){s, e};
}

Edit: here is the ultimate function I'm after. It does 20 add/subs all of which go to port 1 on Haswell. Using my implementation of two_sum_MinMax in this question gets it down to 16 add/subs on port 1 but it has worse latency and is still slower. You can see the assembly for this function and read more about why I care about this at optimize-for-fast-multiplication-but-slow-addition-fma-and-doubledouble

static inline doublefloat4 adddd(const doubledouble4 &a, const doubledouble4 &b) {
        doubledouble4 s, t;
        s = two_sum(a.hi, b.hi);
        t = two_sum(a.lo, b.lo);
        s.lo += t.hi;
        s = quick_two_sum(s.hi, s.lo);
        s.lo += t.lo;
        s = quick_two_sum(s.hi, s.lo);
        return s;
        // 2*two_sum, 2 add, 2*quick_two_sum = 2*6 + 2 + 2*3 = 20 add
}
Z boson
  • 32,619
  • 11
  • 123
  • 226
  • 1
    Correct me if I'm wrong, but wouldn't `minmag = blendv(a, b, cmp); maxmag = blendv(b, a, cmp);` do the same as your code while reusing the same mask? – EOF Jun 03 '15 at 12:12
  • @EOF, it's `maxmag = _mm_blendv_pd(a, b, cmpi);` maybe I should have called it `icmp` instead of `cmpi`. The `i` for invert. – Z boson Jun 03 '15 at 12:18
  • 1
    Yes, I'm aware. But you can *also* invert the blend by reversing the arguments... – EOF Jun 03 '15 at 12:19
  • @EOF, oh, good point! I read your comment wrong not you mine. – Z boson Jun 03 '15 at 12:26
  • Can you give a brief explanation of what `two_sum` is supposed to do ? It doesn't make much sense to me at first glance. Is it for [Kahan summation](http://en.wikipedia.org/wiki/Kahan_summation_algorithm), or something like that ? – Paul R Jun 03 '15 at 13:15
  • 1
    @PaulR, it's not Kahan summation. It's doing (double + double) to [doubledouble](https://en.wikipedia.org/wiki/Quadruple-precision_floating-point_format#Double-double_arithmetic) addition. It's like 64-bit + 64-bit to 128-bit integer addition but for floating point instead of integer: `s` means sum and `e` means error (I assume). See [this question](https://stackoverflow.com/questions/30573443/optimize-for-fast-multiplication-but-slow-addition-fma-and-doubledouble) and read the comments for why I'm interested in this. – Z boson Jun 03 '15 at 13:42
  • OK - thanks - that makes more sense now - I guess I wasn't sure what the `double4` and `doubledouble4` data types were (I'm more of an integer/fixed-point guy myself ;-)). – Paul R Jun 03 '15 at 14:07
  • 1
    @Zboson [This paper](http://web.mit.edu/tabbott/Public/quaddouble-debian/qd-2.3.4-old/docs/qd.pdf) describes a method to do doubledouble addition with only 3 additional instructions when you don't know which number is greater in magnitude. I can't say whether it's faster than the solutions here since the 3 extra instructions are adds/subs which would contend for the same execution units. – Mysticial Jun 03 '15 at 14:27
  • @Mysticial, thanks, that's one of the paper's I'm using. The `two_sum` function I mentioned uses six adds/subs. When you know |a|>|b| it can be done in three. – Z boson Jun 03 '15 at 14:34
  • You can do an FP add on port 0 by using an FMA (multiplying by `1.0`). Latency=5, so don't do it on the critical path. – Peter Cordes Jul 02 '15 at 13:23
  • @PeterCordes, I already tried that, it was not better. See Evgeny Kluev's comment in Paul R's answer. – Z boson Jul 02 '15 at 13:59
  • You said you only tried replacing ALL your adds with FMAs. If dependency chains were an issue at all, then 5 cycle latency instead of 3 would be a problem. Only about half of the adds need to be FMA to balance port0/port1 (depending on what ports the other insns use). – Peter Cordes Jul 02 '15 at 14:20
  • @PeterCordes, you're correct. I could tune it suing some additions with FMA instead of all. I have not done this. I might look into again later. – Z boson Jul 02 '15 at 14:23
  • Try IACA to find adds that aren't on the critical path, and turn those into FMA, if you do get back to it. – Peter Cordes Jul 02 '15 at 14:29
  • @PeterCordes, good point. I was using IACA but I did not think to only turn the adds not on the critical path to FMA. – Z boson Jul 02 '15 at 14:31

1 Answers1

7

Here's an alternate implementation which uses fewer instructions:

static inline void maxminmag_test(__m128d & a, __m128d & b) {
    __m128d cmp     = _mm_add_pd(a, b); // test for mean(a, b) >= 0
    __m128d amin    = _mm_min_pd(a, b);
    __m128d amax    = _mm_max_pd(a, b);
    __m128d minmag  = _mm_blendv_pd(amin, amax, cmp);
    __m128d maxmag  = _mm_blendv_pd(amax, amin, cmp);
    a = maxmag, b = minmag;
}

It uses a somewhat subtle algorithm (see below), combined with the fact that we can use the sign bit as a selection mask.

It also uses @EOF's suggestion of using only one mask and switching the operand order, which saves an instruction.

I've tested it with a small number of cases and it seems to match your original implementation.


Algorithm:

 if (mean(a, b) >= 0)       // this can just be reduced to (a + b) >= 0
 {
     minmag = min(a, b);
     maxmag = max(a, b);
 }
 else
 {
     minmag = max(a, b);
     maxmag = min(a, b);
 }
Paul R
  • 208,748
  • 37
  • 389
  • 560
  • 1
    That's a clever solution. I wanted to use the min/max instructions but did not think of this. Thanks. I'll give a try and get back to you. – Z boson Jun 03 '15 at 12:29
  • Do you know what ports min and max use? I need to beat three add/sub instructions. Your solution already uses one add so if min and max still use port 1 I'm not sure it will be any better (but it's certainly worth a try). – Z boson Jun 03 '15 at 12:32
  • I added to the end of my question the function `two_sum` which I want to replace with `two_sum_MaxMin` if it's faster. – Z boson Jun 03 '15 at 12:40
  • 2
    @Zboson: According to Agner Fog's instruction tables, `MAXPD` on Haswell is `port 1, 3 cycles latency`. You could shift some work to `port 5` by `amin = _mm_min_pd(a,b); amax = a^b^amin`. – EOF Jun 03 '15 at 12:41
  • @Zboson: and you could shift some work to port 0 by using FMA instead of "add". – Evgeny Kluev Jun 03 '15 at 13:28
  • @EvgenyKluev, good point, I tried that last night using `fma(a,1.0,b)` but it gave worse performance. But I used it everywhere (for every addition and subtraction). Maybe if I use fma in some cases instead of for every addition and subtraction it will help. – Z boson Jun 03 '15 at 13:37
  • I made one more edit (sorry) with the ultimate function I'm after. It's ``doubledouble + doubledouble to doubledouble. It needs 20 additions/subtractions. Using my implementation of minmag and maxmag gets it down to 16. – Z boson Jun 03 '15 at 14:42
  • 1
    You provided a clever alternative to my solution. However, your solution is slower than the `two_sum` function. Also the bit-wise and instructions I think have a latency of 1 and don't use port 1. Additionally, your solution is a bit less accurate. I think that's because of the addition you do which can "carry" but I'm not sure. Nevertheless, I really appreciate your answer. – Z boson Jun 04 '15 at 07:02
  • Re accuracy - I guess there might be an edge case when `a == -b` - I think think you're effectively using `|a| > |b|` in you test whereas I'm using `|a| >= |b|` - it didn't show up in my limited testing but it might be worth looking at in case it affects the final result. – Paul R Jun 04 '15 at 07:08
  • 1
    @PaulR, I found a quicker solution to my double-double addition using these minmaxmag functions. See my answer https://stackoverflow.com/questions/30573443/optimize-for-fast-multiplication-but-slow-addition-fma-and-doubledouble/30643684#30643684 – Z boson Jun 04 '15 at 12:26
  • I think I am doing `|a| >= |b|` (using `_mm_cmple_pd(ab,aa)`) – Z boson Jun 04 '15 at 12:28
  • 1
    I just realized in my definition above I did not define what happens when |a| == |b|. There appear to be different definitions. The one in OpenCL returns max(a,b) or min (a,b) when |a| == |b| another definition returns the first argument. I guess I should ask a question about this. My code does the second case. – Z boson Jun 05 '15 at 08:22
  • 1
    E.g. For a = -3, b = 3 should maxmag return a=-3 or max(a,b)=3? – Z boson Jun 05 '15 at 08:25
  • 1
    It appears even the max and min functions in IEEE 754 2008 can be user defined e.g max(-0,+0) can return the first argument https://en.wikipedia.org/wiki/IEEE_754_revision#min_and_max – Z boson Jun 05 '15 at 10:00