How to optimize an FPU routine

Question

I have got some c-routine

    int n_mandelbrot(double c_im, double c_re, int N_ITER)
    {
      static  double re, im, re2, im2;
      static  int n;

      im2=im=0;
      re2=re=0;

     for(n=0; n<N_ITER; n++)
     {
        im =  (re+re)*im    + c_im;
        re =   re2 - im2    + c_re;
        im2=im*im;
        re2=re*re;
        if ( re2 + im2 > 4.0 ) break;
     }

     return n;
   }

want to rewrite it to assembly and I managed to write that

   n_mandelbrot_fpu_double: ;; (double cre, double cim, int N_ITER)

   mov   edx, dword [esp+20]  ;; N_ITER
   mov   ecx, 0

   fld        qword [esp+4+0]  ;; cre
   fld        qword [esp+12+0] ;; cim

   fld1
   fadd st0, st0
   fadd st0, st0         ;; 4.0

   fldz              ;; re = 0
   fldz              ;; im  = 0
   fldz              ;; re2 = 0
   fldz              ;; im2 = 0

   mlloopp:

   ;; here
   ;;            im =  (re+re)*im    + c_im;
   ;;            re =   re2 - im2    + c_re;
   ;;            im2=im*im;
   ;;            re2=re*re;
   ;;            if ( re2 + im2 > 4.0 ) break;

   ;; STACK:  cre cim 4.0 re im re2 im2

   fld st3
   fadd st0, st0
   fmul st3
   fadd st6
   fxch st3
   fstp st0

   fld  st1
   fsub st1
   fadd st7
   fxch st4
   fstp st0

   fld st2
   fmul st0, st0
   fxch st1
   fstp st0

   fld st3
   fmul st0, st0
   fxch st2
   fstp st0

   fld    st0
   fadd   st2
   fcomp  st5
   fnstsw ax
   sahf
   ja    mloopout

   inc    ecx
   cmp    ecx,edx
  jb     mlloopp

  mloopout:

  fstp st0
  fstp st0
  fstp st0
  fstp st0
  fstp st0
  fstp st0
  fstp st0

  mov eax, ecx

  ret

c-routine makes my program loop run 150 ms and with that it dropped to 105 ms so this is faster (though unwinded c-routine with calculation of two pixels in inner loop takes only 115 and I do not know exactly why and how to unroll it in asm )

this asm code is not efficient i think, I tried load all variables on the fpu stack (and before loop I load 7 doubles to it: cre cim 4.0 re im re2 im2 then there is a tol of loading it on top of the stack exchanging and popping back with fstp so i think it is maybe not to efficient

could someone help to improve that (values outside the inner loop does not matter to much but the code in the inner loop counts much here

Don't worry about `fxch`, in normal circumstances it has 0 latency. I would worry about that `fcomp \ fnstsw \ sahf` thing though, that's not great. Don't you have `fcomip`? — harold, Apr 14 '13 at 12:37
fxch may be fast but also here above a use a lot of "fld stN" calc and then "fxch N fstp st0" - to load values deep on stack calculete it then push it back deep the stack again) - is it also fast? - this code above seem to be calculating what has it done but seem also be really stupid to me in terms of unnecesary loading then popping values — grunge fightr, Apr 14 '13 at 12:41
I would be very surprised if you were able to write assembler that outperforms a modern C compiler with full optimization on for the type of function that you have. Perhaps you should first invest into decent C code? Help your compiler to help you. In particular your declaration of all of your variables as `static` is really counterproductive, this constrains the compiler a lot. — Jens Gustedt, Apr 14 '13 at 13:31
Get rid of *static* for an instant improvement. And update your compiler, modern ones generate SSE2 code for this. — Hans Passant, Apr 14 '13 at 13:43
yes, my compiler is able to produce very dense SSE code of 15 lines for the loop. you definitively shouldn't do these things manually these days. — Jens Gustedt, Apr 14 '13 at 13:49
could you post it (those lines for that) as an answer ? I would like to see it - indeed i use old compiler (for come way of convenience), maybe mingw would generate faster code but I cannot test it very quick by now - bot besides I want to exercise hand assembly coding for learning purposes - i like it — grunge fightr, Apr 14 '13 at 14:01
fcomip seem to wrok faster - frame time dropped from 105 ms to 96 ms so it is notciable, can i use it without fwait as I do ? do not understand this fwait stuff :c — grunge fightr, Apr 14 '13 at 14:57
`fwait` is almost never necessary. Leave it out and see what happens. — harold, Apr 14 '13 at 15:19
When N is large (and it will be), you could opt in calculating 2 or 4 iterations per loop without testing (re2+im2)>4 ... and recalculating the last 2 or 4 values. — Aki Suihkonen, Apr 14 '13 at 15:28
this routine of my is all to long and lame - but it needs thinking to shorten it up - and I have no knowledge in that field — grunge fightr, Apr 14 '13 at 16:02
I posted some ideas for vectorizing mandelbrot on a similar question: http://stackoverflow.com/questions/15986390/some-mandelbrot-drawing-routine-from-c-to-sse2/31061038 — Peter Cordes, Jun 26 '15 at 00:48

How to optimize an FPU routine

0 Answers0