2

I want to rewrite such simple routine to SSE2 code, (preferably in nasm) and I am not totally sure how to do it, two things not clear (how to express calculations (inner loop and those from outer loop too) and how to call c code function "SetPixelInDibInt(i ,j, palette[n]);" from under staticaly linked asm code

    void DrawMandelbrotD(double ox, double oy, double lx, int N_ITER)
    {
     double ly = lx * double(CLIENT_Y)/double(CLIENT_X);
     double dx = lx / CLIENT_X;
     double dy = ly / CLIENT_Y;
     double ax = ox - lx * 0.5 + dx * 0.5;
     double ay = oy - ly * 0.5 + dy * 0.5;
    static  double re, im, re_n, im_n, c_re, c_im, rere, imim, int n;

    for(int j=0; j<CLIENT_Y; j+=1)
    {
     for(int i=0; i<CLIENT_X; i+=1)
     {
      c_re = ax + i * dx;
      c_im = ay + j * dy;
      re = c_re;
      im = c_im;
      rere=re*re;
      imim=im*im;
      n=1;

      for(int k=0;k<N_ITER;k++)
      {
        im =  (re+re)*im    + c_im;
        re =   rere - imim  + c_re;
        rere=re*re;
        imim=im*im;
        if ( (rere + imim) > 4.0 ) break;
        n++;
       }
        SetPixelInDibInt(i ,j, palette[n]);
      }
     }
    }

could someone help, I would like not to see other code implementations but just nasm-sse translation of those above - it would be most helpfull in my case - could someone help with that?

user2214913
  • 1,441
  • 2
  • 19
  • 29
  • Have you tried inspecting the assembly file as a starting point ( look at http://stackoverflow.com/questions/1289881/using-gcc-to-produce-readable-assembly ) – EGOrecords Apr 13 '13 at 09:51
  • You should write a C implementation using SSE intrinsics to begin with, and look at the assembly output. Note that you only get 2 `double` values with an SSE register, so consider AVX if you have it. – Brett Hale Apr 13 '13 at 10:03
  • Further to what @Brett says, if you're going to want to keep the calculations in double precision then there is really no point in converting this to SSE, as most modern CPUs have two scalar FP units. – Paul R Apr 13 '13 at 10:31
  • @PaulR which one? Bulldozer, Bobcat, Merom, Nehalem, and Sandy Bridge all have a throughput of 1 for `fadd` and `fmul`. SSE helps on most of them, except Bobcat which splits `addpd` in two ops. – harold Apr 13 '13 at 18:03
  • @harold: Nehalem, for example, has an FP multiply unit on port 0 and an FP add unit on port 1, so you can have a throughput of 2 FLOPs per clock. – Paul R Apr 14 '13 at 07:42
  • @PaulR oh ok I get it, if you combine them, yes. You can also combine an `addpd` (port 1) and `mulpd` (port 0) though – harold Apr 14 '13 at 07:46
  • @harold: yes, and you're probably right that there can be some small gains for DP when using SSE in some cases, it's just that these gains do tend to be relatively small compared to what might be expected with single precision or integer SIMD and are often not worth the effort. YMMV of course. – Paul R Apr 14 '13 at 07:57
  • @PaulR you're right of course, I was just surprised by what looked like a claim that the gain would be guaranteed to be zero – harold Apr 14 '13 at 08:01
  • tell how to do it ;-0 – grunge fightr Apr 14 '13 at 12:36
  • If you're interested in making this run faster, convert it to integer math. – BitBank Apr 18 '13 at 15:12

1 Answers1

2

Intel has a complete implementation as an AVX example. See below.

What makes Mandelbrot tricky is that the early-out condition for each point in the set (i.e. pixel) is different. You could keep a pair or quad of pixels iterating until the magnitude of both exceeds 2.0 (or you hit max iterations). To do otherwise would require tracking which pixel's points were in which vector element.

Anyway, a simplistic implementation to operate on a vector of 2 (or 4 with AVX) doubles at a time would have its throughput limited by the latency of the dependency chains. You'd need to do multiple dependency chains in parallel to keep both of Haswell's FMA units fed. So you'd duplicate your variables, and interleave operations for two iterations of the outer loop inside the inner loop.

Keeping track of which pixels are being calculated would be a little tricky. I think it might take less overhead to use one set of registers for one row of pixels, and another set of registers for another row. (So you can always just move 4 pixels to the right, rather than checking whether the other dep chain is already processing that vector.)

I suspect that only checking the loop exit condition every 4 iterations or so might be a win. Getting code to branch based on a packed vector comparison, is slightly more expensive than in the scalar case. The extra FP add required is also expensive. (Haswell can do two FMAs per cycle, (latency = 5). The lone FP add unit is one the same port as one of the FMA units. The two FP mul units are on the same ports that can run FMA.)

The loop condition can be checked with a packed-compare to generate a mask of zeros and ones, and a (V)PTEST of that register with itself to see if it's all zero. (edit: movmskps then test+jcc is fewer uops, but maybe higher latency.) Then obviously je or jne as appropriate, depending on whether you did a FP compare that leaves zeros when you should exit, or zeros when you shouldn't. NAN shouldn't be possible, but there's no reason not to choose your comparison op such that a NAN will result in the exit condition being true.

const __mm256d const_four = _mm256_set1_pd(4.0);  // outside the loop

__m256i cmp_result = _mm256_cmp_pd(mag_squared, const_four, _CMP_LE_OQ);  // vcmppd.  result is non-zero if at least one element < 4.0
if (_mm256_testz_si256(cmp_result, cmp_result))
    break;

There MIGHT be some way to use PTEST directly on a packed-double, with some bit-hack AND-mask that will pick out bits that will be set iff the FP value is > 4.0. Like maybe some bits in the exponent? Maybe worth considering. I found a forum post about it, but didn't try it out.

Hmm, oh crap, this doesn't record WHEN the loop condition failed, for each vector element separately, for the purpose of coloring the points outside the Mandelbrot set. Maybe test for any element hitting the condition (instead of all), record the result, and then set that element (and c for that element) to 0.0 so it won't trigger the exit condition again. Or maybe scheduling pixels into vector elements is the way to go after all. This code might do fairly well on a hyperthreaded CPU, since there will be a lot of branch mispredicts with every element separately triggering the early-out condition.

That might waste a lot of your throughput, and given that 4 uops per cycle is doable, but only 2 of them can be FP mul/add/FMA, there's room for a significant amount of integer code to schedule points into vector elements. (On Sandybridge/Ivybrideg, without FMA, FP throughput is lower. But there are only 3 ports that can handle integer ops, and 2 of those are the ports for the FP mul and FP add units.)

Since you don't have to read any source data, there's only 1 memory access stream for each dep chain, and it's a write stream. (And it's low bandwidth, since most points take a lot of iterations before you're ready to write a single pixel value.) So the number of hardware prefetch streams isn't a limiting factor for the number of dep chains to run in parallel. Cache misses latency should be hidden by write buffers.

I can write some code if anyone's still interested in this (just post a comment). I stopped at the high-level design stage since this is an old question, though.

==============

I also found that Intel already used the Mandelbrot set as an example for one of their AVX tutorials. They use the mask-off-vector-elements method for the loop condition. (using the mask generated directly by vcmpps to AND). Their results indicate that AVX (single-precision) gave a 7x speedup over scalar float, so apparently it's not common for neighbouring pixels to hit the early-out condition at very different numbers of iterations. (at least for the zoom / pan they tested with.)

They just let the FP results keep accumulating for elements that fail the early-out condition. They just stop incrementing the counter for that element. Hopefully most systems default to having the control word set to zero out denormals, if denormals still take extra cycles.

Their code is silly in one way, though: They track the iteration count for each vector element with a floating-point vector, and then convert it to int at the end before use. It'd be faster, and not occupy an FP execution unit, to use packed-integers for that. Oh, I know why they do that: AVX (without AVX2) doesn't support 256bit integer vector ops. They could have used packed 16bit int loop counters, but that could overflow. (And they'd have to compress the mask down from 256b to 128b).

They also test for all elements being > 4.0 with movmskps and then test that, instead of using ptest. I guess the test / jcc can macro-fuse, and run on a different execution unit than FP vector ops, so it's maybe not even slower. Oh, and of course AVX (without AVX2) doesn't have 256bit PTEST. Also, PTEST is 2 uops, so actually movmskps + test / jcc is fewer uops than ptest + jcc. (PTEST is 1 fused-domain uop on SnB, but still 2 unfused uops for the execution ports. On IvB/HSW, 2 uops even in the fused domain.) So it looks like movmskps is the optimal way, unless you can take advantage of the bitwise AND that's part of PTEST, or need to test more than just the high bit of each element. If a branch is unpredictable, ptest might be lower latency, and thus be worth it by catching mispredicts a cycle sooner.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • That example by Intel is interesting. I do almost the exact same thing but I came up with the idea on my on (I use ptest instead of movemask). I get a 7x speedup with float as well with AVX. Acutally, I use FMA now as well and get about 15% more with FMA. I have also implemented this technique for double-double extending how far I can some. – Z boson Jun 27 '15 at 20:20
  • Your comments and latency vs throughput are interesting. I don't know how to deal with this. It's not simple to unroll the loop because each pixel can run over a different number iterations. Let me be more specific. Let's assume I operate on 8 pixels (float) at once with AVX. Let's call this a superpixel. Each superpixel finishes when the pixel with the highest number of iterations is cut (or hits maixter). Two different superpixels will likely finish for a different number of iterations. I don't see how to unroll the loop for this. – Z boson Jun 27 '15 at 20:24
  • You said you could post some code with your idea. I would be very interesting in seeing this. – Z boson Jun 27 '15 at 20:24
  • Well Intel's code does pretty much what I was thinking: stop incrementing the counter for vector elements where the magnitude^2 > 4.0 test is true. (Actually I was thinking you'd need to mask off *x* and *c* too, to avoid float denormals. But they just mask their vector-of-ones with the result of `vcmpps` every iteration before adding it to the counter. No branch mispredicts until all elements in the vector pass the early-out test. That's pretty clever.) – Peter Cordes Jun 28 '15 at 00:48
  • As far as unrolling the loop: it's not just unrolling, it's interleaving multiple iterations of the outer loop into the inner loop that's the trick. As we both noticed, this is non-trivial because the inner loop for one vector (superpixel) might hit the early-out for all elements at a different time from the other superpixel(s). Keeping track of whether you're doing the rightmost superpixel in that row or not would add some overhead... unless you have each dep chain work on a different row. (still some bookkeeping overhead when a dep chain gets to the end of a row, but that's infrequent.) – Peter Cordes Jun 28 '15 at 00:55
  • My thinking is that there's a `jcc` in the inner loop for the early-out condition for each superpixel that you're interleaving operations for. When that condition is true, you jump out to some code that saves the result, loads the data for the next vector, and does extra stuff if this was the end of a row. Then jump back in. (Or structure things so you're jumping over the outer-loop stuff with a predicted-taken, so there's no "jump back in", as long as that doesn't reduce throughput from the uop cache by having partially-filled cachelines.) – Peter Cordes Jun 28 '15 at 01:04
  • Anyway, the multiple superpixels that you're processing in parallel to hide the latency are independent. You don't wait for them all to finish before moving on. When one superpixel is all done, you load its *x*, *c*, and iteration-counter register with new data, and leave the other one(s) untouched, then continue the inner loop. I might spend some time and try to implement this, now that Intel's provided a fully-working implementation. Although their code already uses all 15 `ymm` registers. Spilling to memory is prob. too big a perf hit. – Peter Cordes Jun 28 '15 at 01:06