SIMD optimization of a curve computed from the second derivative

Question

This question is really a curiosity.

I was converting a routine into SIMD instructions (and I am quite new to SIMD programming), and had trouble with the following bit of code:

// args:
uint32_t phase_current;
uint32_t phase_increment;
uint32_t phase_increment_step;

for (int i = 0; i < blockSize; ++i)
{
    USEFUL_FUNC(phase_current);
    phase_increment += phase_increment_step;
    phase_current += phase_increment;
}

The Question: Assuming that USEFUL_FUNC has a SIMD implementation and I am simply trying to compute a correct vector of phase_current for processing, what is the right way of dealing with the phase_current being dependent on its previous value?

In turn, a functional programming fold-like implementation would be similarly useful, since I'm trying to understand how to hoist out a data dependency more that I am trying to optimize for the sake of optimizing.

Lastly, if you can recommend some literature, please do. Not sure how to Google for this topic.

More concisely, how do I deal with the pi += pis; pc += pi; using SIMD vector instructions? (Specifically, is it possible to vectorize the phase_current and its increment, since phase_increment depends on its previous state for all values to be correct.) — ilzxc, Dec 26 '17 at 21:45
I'm not sure if this is even possible for general USEFUL_FUNC, since phase_current depends on its previous value. If however you are actually asking about computing a second antiderivative of a function, as your title suggests, your code might not reflect what you mean. Maybe you want USEFUL_FUNC to compute something and store the result in an array? — Tobias Ribizel, Dec 26 '17 at 21:54
@TobiasRibizel I need to figure out how to rewrite the question. The problem here is to vectorize the `phase_current`, preferably with SIMD instructions. I'm asking if there's a way to avoid `phase_current`'s dependency on its previous value in computing the same thing. (Clearly, the `phase_increment` slope is `phase_increment_step` so we can vectorize the `phase_increment` values.) — ilzxc, Dec 26 '17 at 21:57
Then my answer would be that for general USEFUL_FUNC, this is not possible. If you call pc0 the initial value of pc, pc1 depends on pc0 and pi, pc2 depends on pc1 and pi and thus you need to finish computing pc1 before you can compute pc2, which makes any parallel computation of pc1 and pc2 (especially via SIMD) impossible. — Tobias Ribizel, Dec 26 '17 at 22:08
I agreed -- please disregard USEFUL_FUNC. All I'm asking about is how to remove dependency between pc1 and pc2 in computing a vector of pc. It's not going to be written the same way, but there are tricks in eliminating previous state in vectorized loops (e.g. http://reanimator-web.appspot.com/articles/simdiir) so, with no disrespect, I don't think it's impossible. — ilzxc, Dec 26 '17 at 22:24
@TobiasRibizel: SIMD prefix-sums are possible using log2(vector_width) shuffles, so this pattern of data dependency sucks but doesn't prevent vectorization, only makes it somewhat more expensive. See links in my answer. But in this case we can get rid of the dependency with math. — Peter Cordes, Dec 27 '17 at 10:04
@PeterCordes My understanding of the OP was that USEFUL_FUNC reads from and writes to phase_current, thus making the data dependency impossible to break (this looked to me somewhat like an integration along the yet to be computed solution of a differential equation). I know about parallel prefix sums already, but had simply misunderstood the question ;) — Tobias Ribizel, Dec 27 '17 at 14:44

score 2 · Answer 1 · answered Dec 26 '17 at 22:06

2

The only thing I could think about is horizontal add. Imagine you have __m128i vector with content {pc, 0, pi, pis}. Then first HADD will make it into {pc, pi + pis}, and second HADD will make it into pc + pi + pis.

HADD operates on two __m128i at once, so some speed-up is possible.

But interleaving instructions such that pipeline is always full won't be a trivial exercise. Link to HADD: https://msdn.microsoft.com/en-us/library/bb531452(v=vs.120).aspx

Let me add link to very useful discussion wrt HADD for floats. A lot of code and conclusions could be directly applied to integer HADD: Fastest way to do horizontal float vector sum on x86

answered Dec 26 '17 at 22:06

Severin Pappadeux

18,636
3
38
64

Thanks, very grateful for this, because I haven't considered it. The problem of two increments is not a trivial thing to vectorize / parallelize. – ilzxc Dec 26 '17 at 22:12
The HADD for floats discussion is crazy useful. Thanks! – ilzxc Dec 26 '17 at 22:15
1

The biggest point in my answer on the horizontal sum question is that you should *not* use the `hadd` / `phadd` instructions (`_mm_hadd_epi32`) if you need it to run fast (like here where it's in an inner loop, @ilzxc). You should use regular shuffles like `pshufd`, or `psrldq` (`_mm_srli_si128`) if you need to shift in zeros. I'd recommend *not* linking to MSVC's hadd intrinsic. This problem looks more like a prefix-sum. See https://stackoverflow.com/questions/10587598/simd-prefix-sum-on-intel-cpu for a version using only 2 shuffles per vector of 4x 32-bit elements. – Peter Cordes Dec 27 '17 at 07:25
Oh wait, it's not as hard as a prefix sum because the `increment_step` is constant.. With a constant 2nd-derivative, the math allows striding through the sequence with no extra work using only vertical add. See [my answer](https://stackoverflow.com/questions/47983660/simd-optimization-of-a-curve-computed-from-the-second-derivative/47989486#47989486) – Peter Cordes Dec 27 '17 at 09:24

Peter Cordes · Accepted Answer · 2017-12-27T10:07:58.290

So you're just looking for a way to generate vectors of 4 phase_current values, which you can pass as an arg to an arbitrary function.

TL:DR: set up the initial vectors for increment and step so each vector element strides through the sequence by 4, giving you vectors of phase_current[i+0..i+3] with still only two vector ADD operations (vertical, not horizontal). This serial dependency is one you can factor out with algebra / math.

This is a bit like a prefix-sum (which you can SIMD with log2(vector_width) shuffle+add operations for vectors with vector_width elements.) You can also parallelize prefix sums with multiple threads using a two-step calculation where each thread prefix-sums a region of an array, then combine the results and have each thread offset its region of the destination array by a constant (the sum total for the first element of that region. See the linked question for multi-threading, too.

But you have the huge simplification that phase_increment_step (the 2nd derivative of the value you want) is constant. I'm assuming that USEFUL_FUNC(phase_current); takes its arg by value, not by non-const reference, so the only modification to phase_current is by the += in the loop. And that useful_func can't somehow mutate the increment or increment_step.

One option to implement this is just to run the scalar algorithm independently in 4 separate elements of SIMD vectors, offset by 1 iteration each time. With integer adds, especially on Intel CPUs where vector-integer add latency is only 1 cycle, running 4 iterations of the running-total is cheap, and we could just do that between calls to USEFUL_FUNC. That would be a way to generate vector inputs to USEFUL_FUNC doing exactly as much work as scalar code (assuming SIMD integer add is as cheap as scalar integer add, which is mostly true if we're limited by the data dependency to 2 adds per clock).

The above method is somewhat more general and could be useful for variations on this problem where there's a true serial dependency that we can't eliminate cheaply with simple math.

If we're clever, we can do even better than a prefix sum or brute-force running 4 sequences one step at a time. Ideally we can derive a closed-form way to step by 4 in the sequence of values (or whatever the SIMD vector width is, times whatever unroll factor you want for multiple accumulators for USEFUL_FUNC).

Summing a sequence of step, step*2, step*3, ... will give us a constant times Gauss's closed-form formula for the sum of integers up to n: sum(1..n) = n*(n+1)/2. This sequence goes 0, 1, 3, 6, 10, 15, 21, 28, ... (https://oeis.org/A000217). (I've factored out the initial phase_increment).

The trick is going by 4 in this sequence. (n+4)*(n+5)/2 - n*(n+1)/2 simplifies down to 4*n + 10. Taking the derivative of that again, we get 4. But to go 4 steps in the 2nd integral, we have 4*4 = 16. So we can maintain a vector phase_increment which we increment with a SIMD add with a vector of 16*phase_increment_step.

I'm not totally sure I have the step-counting reasoning right (the extra factor of 4 to give 16 is a bit surprising). Working out the right formulas and taking first and second-differences in the sequence of vectors makes it very clear how this works out:

 // design notes, working through the first couple vectors
 // to prove this works correctly.

S = increment_step (constant)
inc0 = increment initial value
p0 = phase_current initial value

// first 8 step-increases:
[ 0*S,  1*S,   2*S,  3*S ]
[ 4*S,  5*S,   6*S,  7*S ]

// first vector of 4 values:
[ p0,  p0+(inc0+S),  p0+(inc0+S)+(inc0+2*S),  p0+(inc0+S)+(inc0+2*S)+(inc0+3*S) ]
[ p0,  p0+inc0+S,  p0+2*inc0+3*S,  p0+3*inc0+6*S ]  // simplified

// next 4 values:
[ p0+4*inc0+10*S,  p0+5*inc0+15*S,  p0+6*inc0+21*S,  p0+7*inc0+28*S ]

Using this and the earlier 4*n + 10 formula:

// first 4 vectors of of phase_current
[ p0,              p0+1*inc0+ 1*S,  p0+2*inc0+3*S,   p0+ 3*inc0+ 6*S ]
[ p0+4*inc0+10*S,  p0+5*inc0+15*S,  p0+6*inc0+21*S,  p0+ 7*inc0+28*S ]
[ p0+8*inc0+36*S,  p0+9*inc0+45*S,  p0+10*inc0+55*S, p0+11*inc0+66*S ]
[ p0+12*inc0+78*S,  p0+13*inc0+91*S,  p0+14*inc0+105*S, p0+15*inc0+120*S ]

 first 3 vectors of phase_increment (subtract consecutive phase_current vectors):
[ 4*inc0+10*S,     4*inc0 + 14*S,   4*inc0 + 18*S,   4*inc0 + 22*S  ]
[ 4*inc0+26*S,     4*inc0 + 30*S,   4*inc0 + 34*S,   4*inc0 + 38*S  ]
[ 4*inc0+42*S,     4*inc0 + 46*S,   4*inc0 + 50*S,   4*inc0 + 54*S  ]

 first 2 vectors of phase_increment_step:
[        16*S,              16*S,            16*S,            16*S  ]
[        16*S,              16*S,            16*S,            16*S  ]
Yes, as expected, a constant vector works for phase_increment_step

So we can write code like this with Intel's SSE/AVX intrinsics:

#include <stdint.h>
#include <immintrin.h>

void USEFUL_FUNC(__m128i);

// TODO: more efficient generation of initial vector values
void double_integral(uint32_t phase_start, uint32_t phase_increment_start, uint32_t phase_increment_step, unsigned blockSize)
{

    __m128i pstep1 = _mm_set1_epi32(phase_increment_step);

    // each vector element steps by 4
    uint32_t inc0=phase_increment_start, S=phase_increment_step;
    __m128i pincr  = _mm_setr_epi32(4*inc0 + 10*S,  4*inc0 + 14*S,   4*inc0 + 18*S,   4*inc0 + 22*S);

    __m128i phase = _mm_setr_epi32(phase_start,  phase_start+1*inc0+ 1*S,  phase_start+2*inc0+3*S,   phase_start + 3*inc0+ 6*S );
     //_mm_set1_epi32(phase_start); and add.
     // shuffle to do a prefix-sum initializer for the first vector?  Or SSE4.1 pmullo by a vector constant?

    __m128i pstep_stride = _mm_slli_epi32(pstep1, 4);  // stride by pstep * 16
    for (unsigned i = 0; i < blockSize; ++i)  {
        USEFUL_FUNC(phase);
        pincr = _mm_add_epi32(pincr, pstep_stride);
        phase = _mm_add_epi32(phase, pincr);
    }
}

Further reading: for more about SIMD in general, but mostly x86 SSE/AVX, see https://stackoverflow.com/tags/sse/info, especially slides from SIMD at Insomniac Games (GDC 2015) which have some good stuff about how to think about SIMD in general, and how to lay out your data so you can use it.

SIMD optimization of a curve computed from the second derivative

2 Answers2

Linked