Intel has a complete implementation as an AVX example. See below.
What makes Mandelbrot tricky is that the early-out condition for each point in the set (i.e. pixel) is different. You could keep a pair or quad of pixels iterating until the magnitude of both exceeds 2.0 (or you hit max iterations). To do otherwise would require tracking which pixel's points were in which vector element.
Anyway, a simplistic implementation to operate on a vector of 2 (or 4 with AVX) doubles at a time would have its throughput limited by the latency of the dependency chains. You'd need to do multiple dependency chains in parallel to keep both of Haswell's FMA units fed. So you'd duplicate your variables, and interleave operations for two iterations of the outer loop inside the inner loop.
Keeping track of which pixels are being calculated would be a little tricky. I think it might take less overhead to use one set of registers for one row of pixels, and another set of registers for another row. (So you can always just move 4 pixels to the right, rather than checking whether the other dep chain is already processing that vector.)
I suspect that only checking the loop exit condition every 4 iterations or so might be a win. Getting code to branch based on a packed vector comparison, is slightly more expensive than in the scalar case. The extra FP add required is also expensive. (Haswell can do two FMAs per cycle, (latency = 5). The lone FP add unit is one the same port as one of the FMA units. The two FP mul units are on the same ports that can run FMA.)
The loop condition can be checked with a packed-compare to generate a mask of zeros and ones, and a (V)PTEST
of that register with itself to see if it's all zero. (edit: movmskps
then test+jcc
is fewer uops, but maybe higher latency.) Then obviously je
or jne
as appropriate, depending on whether you did a FP compare that leaves zeros when you should exit, or zeros when you shouldn't. NAN shouldn't be possible, but there's no reason not to choose your comparison op such that a NAN will result in the exit condition being true.
const __mm256d const_four = _mm256_set1_pd(4.0); // outside the loop
__m256i cmp_result = _mm256_cmp_pd(mag_squared, const_four, _CMP_LE_OQ); // vcmppd. result is non-zero if at least one element < 4.0
if (_mm256_testz_si256(cmp_result, cmp_result))
break;
There MIGHT be some way to use PTEST
directly on a packed-double, with some bit-hack AND-mask that will pick out bits that will be set iff the FP value is > 4.0. Like maybe some bits in the exponent? Maybe worth considering. I found a forum post about it, but didn't try it out.
Hmm, oh crap, this doesn't record WHEN the loop condition failed, for each vector element separately, for the purpose of coloring the points outside the Mandelbrot set. Maybe test for any element hitting the condition (instead of all), record the result, and then set that element (and c
for that element) to 0.0 so it won't trigger the exit condition again. Or maybe scheduling pixels into vector elements is the way to go after all. This code might do fairly well on a hyperthreaded CPU, since there will be a lot of branch mispredicts with every element separately triggering the early-out condition.
That might waste a lot of your throughput, and given that 4 uops per cycle is doable, but only 2 of them can be FP mul/add/FMA, there's room for a significant amount of integer code to schedule points into vector elements. (On Sandybridge/Ivybrideg, without FMA, FP throughput is lower. But there are only 3 ports that can handle integer ops, and 2 of those are the ports for the FP mul and FP add units.)
Since you don't have to read any source data, there's only 1 memory access stream for each dep chain, and it's a write stream. (And it's low bandwidth, since most points take a lot of iterations before you're ready to write a single pixel value.) So the number of hardware prefetch streams isn't a limiting factor for the number of dep chains to run in parallel. Cache misses latency should be hidden by write buffers.
I can write some code if anyone's still interested in this (just post a comment). I stopped at the high-level design stage since this is an old question, though.
==============
I also found that Intel already used the Mandelbrot set as an example for one of their AVX tutorials. They use the mask-off-vector-elements method for the loop condition. (using the mask generated directly by vcmpps
to AND). Their results indicate that AVX (single-precision) gave a 7x speedup over scalar float, so apparently it's not common for neighbouring pixels to hit the early-out condition at very different numbers of iterations. (at least for the zoom / pan they tested with.)
They just let the FP results keep accumulating for elements that fail the early-out condition. They just stop incrementing the counter for that element. Hopefully most systems default to having the control word set to zero out denormals, if denormals still take extra cycles.
Their code is silly in one way, though: They track the iteration count for each vector element with a floating-point vector, and then convert it to int at the end before use. It'd be faster, and not occupy an FP execution unit, to use packed-integers for that. Oh, I know why they do that: AVX (without AVX2) doesn't support 256bit integer vector ops. They could have used packed 16bit int loop counters, but that could overflow. (And they'd have to compress the mask down from 256b to 128b).
They also test for all elements being > 4.0 with movmskps
and then test that, instead of using ptest
. I guess the test / jcc
can macro-fuse, and run on a different execution unit than FP vector ops, so it's maybe not even slower. Oh, and of course AVX (without AVX2) doesn't have 256bit PTEST
. Also, PTEST
is 2 uops, so actually movmskps
+ test / jcc
is fewer uops than ptest + jcc
. (PTEST
is 1 fused-domain uop on SnB, but still 2 unfused uops for the execution ports. On IvB/HSW, 2 uops even in the fused domain.) So it looks like movmskps
is the optimal way, unless you can take advantage of the bitwise AND that's part of PTEST
, or need to test more than just the high bit of each element. If a branch is unpredictable, ptest
might be lower latency, and thus be worth it by catching mispredicts a cycle sooner.