I'm going to answer this on a very basic level for those people new to x87 who may be faced with a calculation that needs to be done on the FPU.
There are two things to consider. If you are given a calculation (INFIX notation) like:
root := (2.0*root + x/(root*root)) / 3.0
Is there a way to translate this into basic instructions that can be used by the x87 FPU? Yes, at a very basic level the x87 FPU is a stack that acts like a sophisticated RPN calculator. The equation in your code is INFIX notation. If you convert this to POSTFIX(RPN) notation, it can easily be implemented as a stack with operations.
This document provides some information on converting to POSTFIX notation. Following the rules your POSTFIX equivalent would look like:
2.0 root * x root root * / + 3.0 /
You could literally put that into an old RPN calculator (HPs) like the HP 15C using these keys where root=1
and x=27
:
2.0 [enter] root * x [enter] root [enter] root * / + 3.0 /
The online HP 15C should show the result of that calculation being 9.667. Translating this to basic x87:
- A number is a push to top of stack (fld)
- A variable is a push to top of stack (fld)
*
is fmulp (Multiply ST(1) by ST(0), store result in ST(1), and pop the register stack)
/
is fdivp (Divide ST(1) by ST(0), store result in ST(1), and pop the register stack)
+
is faddp (Add ST(0) to ST(1), store result in ST(1), and pop the register stack)
-
is fsubp (Subtract ST(0) from ST(1), store result in ST(1), and pop register stack)
You can literally convert 2.0 root * x root root * / + 3.0 /
to x87 instructions:
fld Two ; st(0)=2.0
fld root ; st(0)=root, st(1)=2.0
fmulp ; st(0)=(2.0 * root)
fld xx ; st(0)=x, st(1)=(2.0 * root)
fld root ; st(0)=root, st(1)=x, st(2)=(2.0 * root)
fld root ; st(0)=root, st(1)=root, st(2)=x, st(3)=(2.0 * root)
fmulp ; st(0)=(root * root), st(1)=x, st(2)=(2.0 * root)
fdivp ; st(0)=(x / (root * root)), st(1)=(2.0 * root)
faddp ; st(0)=(2.0 * root) + (x / (root * root))
fld Three ; st(0)=3.0, st(1)=(2.0 * root) + (x / (root * root))
fdivp ; st(0)=((2.0 * root) + (x / (root * root))) / 3.0
Once you have the basics, you can move on to improving efficiency.
Regarding Edit 2 / Followup question
One thing to keep in mind is that if you don't use instructions that pop values off the stack, each iteration of the loop will consume more FPU stack slots. Generally the FPU instructions ending with P
pop values off the stack. You don't use any instructions to remove items off the stack, the FPU stack keeps growing.
Unlike the program stack in user space, the FPU stack is very limited as it only has 8 slots. If you put more than 8 active values on the stack you will get overflow errors in the form of 1#IND
. If we analyze your code and view the stack after each instruction we'd find this:
fld root ; st(0)=root
repreatAgain:
fmul two ; st(0)=(2.0*root)
fld root ; st(0)=root, st(1)=(2.0*root)
fmul st(0), st(0) ; st(0)=(root*root), st(1)=(2.0*root)
fld xx ; st(0)=x, st(1)=(root*root), st(2)=(2.0*root)
fdiv st(0), st(1) ; st(0)=(x/(root*root)), st(1)=(root*root), st(2)=(2.0*root)
fadd st(0), st(2) ; st(0)=((2.0*root) + x/(root*root)), st(1)=(root*root), st(2)=(2.0*root)
fld three ; st(0)=3.0, st(1)=((2.0*root) + x/(root*root)), st(2)=(root*root), st(3)=(2.0*root)
fld st(1) ; st(0)=((2.0*root) + x/(root*root)), st(1)=3.0, st(2)=((2.0*root) + x/(root*root)), st(3)=(root*root), st(4)=(2.0*root)
fdiv st(0), st(1) ; st(0)=(((2.0*root) + x/(root*root))/3.0), st(1)=3.0, st(2)=((2.0*root) + x/(root*root)), st(3)=(root*root), st(4)=(2.0*root)
jmp repreatAgain
Observe that after the last FDIV instruction and before the JMP we have 5 items on the stack (st(0) through st(4)). When we entered the loop we only had 1 which was root
in st(0). The best way to resolve this is to use instructions in such a way that values get popped (removed) from the stack as the calculation progresses.
One other less efficient way is to free up the values we no longer want on the stack before repeating the loop. The FFREE instruction can used for this purpose by manually marking the entries unused starting from the bottom of the stack. If you add these lines after the code above, and before the jmp repreatAgain
the code should work:
ffree st(4) ; st(0)=(((2.0*root) + x/(root*root))/3.0), st(1)=3.0, st(2)=((2.0*root) + x/(root*root)), st(3)=(root*root)
ffree st(3) ; st(0)=(((2.0*root) + x/(root*root))/3.0), st(1)=3.0, st(2)=((2.0*root) + x/(root*root))
ffree st(2) ; st(0)=(((2.0*root) + x/(root*root))/3.0), st(1)=3.0
ffree st(1) ; st(0)=(((2.0*root) + x/(root*root))/3.0)
fst root ; Update root variable
jmp repreatAgain
With the use of the FFREE instruction we end the loop with only the new root
in st(0).
I've also added fst root
because of the way you did your calculation. Your calculation includes fld root
which relies on the value in root
being updated when each loop is finished. There is a more efficient way of doing this but I'm providing a fix that works in your current code without much reworking.
If you were to use the inefficient/simple code snippet I provided earlier to do the calculations you'd end up with code likes this:
finit ; initialize FPU
repreatAgain:
fld Two ; st(0)=2.0
fld root ; st(0)=root, st(1)=2.0
fmulp ; st(0)=(2.0 * root)
fld xx ; st(0)=x, st(1)=(2.0 * root)
fld root ; st(0)=root, st(1)=x, st(2)=(2.0 * root)
fld root ; st(0)=root, st(1)=root, st(2)=x, st(3)=(2.0 * root)
fmulp ; st(0)=(root * root), st(1)=x, st(2)=(2.0 * root)
fdivp ; st(0)=(x / (root * root)), st(1)=(2.0 * root)
faddp ; st(0)=(2.0 * root) + (x / (root * root))
fld Three ; st(0)=3.0, st(1)=(2.0 * root) + (x / (root * root))
fdivp ; newroot = st(0)=((2.0 * root) + (x / (root * root))) / 3.0
fstp root ; Store result at top of stack into root and pop value
; at this point the stack is clear again since
; all items pushed have been popped.
jmp repreatAgain
This code doesn't require FFREE because elements are popped off the stack as the calculation progresses. The instruction FADDP, FSUBP, FDIVP, FADDP will additionally pop the value off the top of the stack. This has the side effect of keeping the stack clear of the partial intermediate calculations.
Integrate a Loop
To integrate the loop into the simple/inefficient code I created earlier, you can use a variant of the FCOM (Floating point compare) for comparison. The results of the floating point compare is then transferred/converted to the regular CPU flags (EFLAGS). One can then use the regular comparison operators to perform the conditional checks. The code could look like this:
epsilon REAL4 0.001
.CODE
main PROC
finit ; initialize FPU
repeatAgain:
fld Two ; st(0)=2.0
fld root ; st(0)=root, st(1)=2.0
fmulp ; st(0)=(2.0 * root)
fld xx ; st(0)=x, st(1)=(2.0 * root)
fld root ; st(0)=root, st(1)=x, st(2)=(2.0 * root)
fld root ; st(0)=root, st(1)=root, st(2)=x, st(3)=(2.0 * root)
fmulp ; st(0)=(root * root), st(1)=x, st(2)=(2.0 * root)
fdivp ; st(0)=(x / (root * root)), st(1)=(2.0 * root)
faddp ; st(0)=(2.0 * root) + (x / (root * root))
fld Three ; st(0)=3.0, st(1)=(2.0 * root) + (x / (root * root))
fdivp ; newroot=st(0)=((2.0 * root) + (x / (root * root))) / 3.0
fld root ; st(0)=oldroot, st(1)=newroot
fsub st(0), st(1) ; st(0)=(oldroot-newroot), st(1)=newroot
fabs ; st(0)=(|oldroot-newroot|), st(1)=newroot
fld epsilon ; st(0)=0.001, st(1)=(|oldroot-newroot|), st(2)=newroot
fcompp ; Do compare&set x87 flags pop top two values off stack
; st(0)=newroot
fstsw ax ; Copy x87 Status Word containing the result to AX
fwait ; Insure previous instruction completed
sahf ; Transfer condition codes to the CPU's flags register
fstp root ; Store result (newroot) at top of stack into root
; and pop value. At this point the stack is clear
; again since all items pushed have been popped.
jbe repeatAgain ; If 0.001 <= (|oldroot-newroot|) repeat
mov eax, 0 ; exit
ret
main ENDP
END
Note: The usage of FCOMPP and manual transfer of x87 flags to CPU flags is driven by the fact that you have .586 directive at the top of your code. I'm making the assumption that because you didn't specify .686 or later that instructions like FCOMI are not available. If you were using .686
or later, then the bottom part of the code could have looked like:
fld root ; st(0)=oldroot, st(1)=newroot
fsub st(0), st(1) ; st(0)=(oldroot-newroot), st(1)=newroot
fabs ; st(0)=(|oldroot-newroot|), st(1)=newroot
fld epsilon ; st(0)=0.001, st(1)=(|oldroot-newroot|), st(2)=newroot
fcomip st(0),st(1) ; Do compare & set CPU flags, pop one value off stack
; st(0)=(|oldroot-newroot|), st(1)=newroot
fstp st(0) ; Pop temporary value off top of stack
; st(0)=newroot
fstp root ; Store result (newroot) at top of stack into root
; and pop value. At this point the stack is clear
; again since all items pushed have been popped.
jbe repeatAgain ; If 0.001 <= (|oldroot-newroot|) repeat
Quick method to creating RPN/Postfix from Infix notation
If learning to convert Infix notation to RPN/Postfix seems a bit daunting from the document I linked earlier in my question there is some relief. There are a number of websites that will do this work for you. One such site is MathBlog. Just enter your equation, click convert and it should show you the RPN/Postfix equivalent. It is limited to +-/*, parentheses and exponents with ^.
Optimizations
A big key to optimizing the code is to optimize the formula by separating the parts that remain constant between each loop from the parts that are variable. The constant parts can be computed before the loop begins.
Your original equation is this:

Separating the constants part we can arrive at:

If we replace the constants with identifiers where twothirds
= 2.0/3.0, and xover3
= x/3 then we end up with a simplified equation that looks like this:

If we convert that to POSTFIX/RPN then we get:
twothirds root * xover3 root root * / +
A similar optimization is what Peter is taking advantage of in his answer under the section Better loop body. He places the constants Twothirds
and Xover3
onto the x87 FPU stack outside the loop, and references them as needed inside the loop. This avoids having to reread them unnecessarily from memory each time through the loop.
A more complete example based upon the optimization above:
.586
.MODEL FLAT
.STACK 4096
.DATA
xx REAL4 27.0
root REAL4 1.0
Three REAL4 3.0
epsilon REAL4 0.001
Twothirds REAL4 0.6666666666666666
.CODE
main PROC
finit ; Initialize FPU
fld epsilon ; st(0)=epsilon
fld root ; st(0)=prevroot (Copy of root), st(1)=epsilon
fld TwoThirds ; st(0)=(2/3), st(1)=prevroot, st(2)=epsilon
fld xx ; st(0)=x, st(1)=(2/3), st(2)=prevroot, st(3)=epsilon
fdiv Three ; st(0)=(x/3), st(1)=(2/3), st(2)=prevroot, st(3)=epsilon
fld st(2) ; st(0)=root, st(1)=(x/3), st(2)=(2/3), st(3)=prevroot, st(4)=epsilon
repeatAgain:
; twothirds root * xover3 root root * / +
fld st(0) ; st(0)=root, st(1)=root, st(2)=(x/3), st(3)=(2/3), st(4)=prevroot, st(5)=epsilon
fmul st(0), st(3) ; st(0)=(2/3*root), st(1)=root, st(2)=(x/3), st(3)=(2/3), st(4)=prevroot, st(5)=epsilon
fxch ; st(0)=root, st(1)=(2/3*root), st(2)=(x/3), st(3)=(2/3), st(4)=prevroot, st(5)=epsilon
fmul st(0), st(0) ; st(0)=(root*root), st(1)=(2/3*root), st(2)=(x/3), st(3)=(2/3), st(4)=prevroot, st(5)=epsilon
fdivr st(0), st(2) ; st(0)=((x/3)/(root*root)), st(1)=(2/3*root), st(2)=(x/3), st(3)=(2/3), st(4)=prevroot, st(5)=epsilon
faddp ; st(0)=((2/3*root)+(x/3)/(root*root)), st(1)=(x/3), st(2)=(2/3), st(3)=prevroot, st(4)=epsilon
fxch st(3) ; st(0)=prevroot, st(1)=(x/3), st(2)=(2/3), newroot=st(3)=((2/3*root)+(x/3)/(root*root)), st(4)=epsilon
fsub st(0), st(3) ; st(0)=(prevroot-newroot), st(1)=(x/3), st(2)=(2/3), st(3)=newroot, st(4)=epsilon
fabs ; st(0)=(|prevroot-newroot|), st(1)=(x/3), st(2)=(2/3), st(3)=newroot, st(4)=epsilon
fld st(4) ; st(0)=0.001, st(1)=(|prevroot-newroot|), st(2)=(x/3), st(3)=(2/3), st(4)=newroot, st(5)=epsilon
fcompp ; Do compare&set x87 flags pop top two values off stack
; st(0)=(x/3), st(1)=(2/3), st(2)=newroot, st(3)=epsilon
fstsw ax ; Copy x87 Status Word containing the result to AX
fwait ; Insure previous instruction completed
sahf ; Transfer condition codes to the CPU's flags register
fld st(2) ; st(0)=newroot, st(1)=(x/3), st(2)=(2/3), st(3)=newroot, st(4)=epsilon
jbe repeatAgain ; If 0.001 <= (|oldroot-newroot|) repeat
; Remove temporary values on stack, cubed root in st(0)
ffree st(4) ; st(0)=newroot, st(1)=(x/3), st(2)=(2/3), st(3)=epsilon
ffree st(3) ; st(0)=newroot, st(1)=(x/3), st(2)=(2/3)
ffree st(2) ; st(0)=newroot, st(1)=(x/3)
ffree st(1) ; st(0)=newroot
mov eax, 0 ; exit
ret
main ENDP
END
This code places these values on the stack prior to entering the loop:
- st(4) =
Epsilon
value (0.001)
- st(3) = A copy of
root
before calculations are done (effectively prevroot
)
- st(2) = The constant
Twothirds
(2/3)
- st(1) =
Xover3
(x/3)
- st(0) = Active copy of
root
Before the loop repeats, the stack will have the layout above.
The code at the end before exiting removes all the temporary values and simply leaves the stack with the value of root
on the top in st(0).