Fault in calculation with small numbers in fpu?

Question

Hi, i am trying to do this formula in fpu.

(y = v*t*sin(a) - 0.5*g*t^2)

My code in c++ is:

typedef void(*Ipa_algorithm2)(double t, double alfa, double *return_value);
Ipa_algorithm2 count_y;
count_y = (Ipa_algorithm2)GetProcAddress(hInstLibrary, "ipa_algorithm");
t = t + 0.01; //going from cas = 0
(*count_y)(t,camera.angleY, &y); //t = cas;

and my code in asm is:

section .data
help_var dq 0 
speed dq 40.0 ;v = rychlost
number dq 180.0
grav dq 4.906865 ;grav= 0,5*g

ipa_algorithm2:
push ebp
mov ebp, esp
finit
fld qword [speed]
fld qword [ebp+8]
fmul st1
fstp qword [help_var] ;v pomocny je v*t
fldpi               
fld qword [ebp+16]  ;na st0 je uhel a na st1 3,14
fmul st1 ;na st0 je uhel * 3,14
fld qword [number]
fxch st1 ;na st0 je uhel*3,14 na st1 je 180
fdiv st1 ;na st0 je uhel v radianech
fsin
fld qword [help_var]
fmul st1 ;na st0 je v*t*sin uhlu
fst qword [help_var]
finit
fld qword [ebp+8]
fld qword [ebp+8]
fmul st1
fld qword [grav]
fmul st1
fld qword [help_var]
fxch st1
fsub st1


mov eax,[ebp+24]
fstp qword [eax]    

mov esp, ebp
pop ebp
ret 0

The problem is, that function ipa_algorithm2 is giving me correct numbers from the start (compared with output from program doing the same in C), but after severals steps the results start to be worse and worse. I was checking the code for 3 hours and I didn't find any mistake. Is it possible that the numbers I am counting with are so small that fpu can't count with them?

Which one is actually correct (using extended precision to check the result?) It's not clear from your code what your inputs and outputs are, because your variable names don't match each other, aren't in English, and you don't show values for most of them. — Peter Cordes, Dec 02 '17 at 11:10
Are you planning to actually use that x87 code for something? If so, why? (Why not SSE?) You realize this is pretty inefficiently written, right? (`finit` instead of using `fstp` or `fmulp` to leave the stack balanced, and reloading the same value from the stack multiple times instead of `fld st0`) — Peter Cordes, Dec 02 '17 at 11:12
The return value is y, I am counting track of a shot and variable y is point on a axis y. I don't want to use SSE (I don't think that it's worth to learn something what soon won't exist), but I am gonna use AVX. However i want use avx after this start to work. — Erik, Dec 02 '17 at 11:48
SSE and AVX are nearly identical for this. Learning AVX when you already know SSE is: put a `V` at the start of every mnemonic, and there's a separate destination operand. (e.g. `mulsd xmm0, xmm1` becomes `vmulsd xmm2, xmm0, xmm1` (so you save `movaps` instructions). Even if you don't vectorize, SSE2/AVX is better for scalar math (unless you need 80-bit `long double`). The next question becomes: why are you writing this in asm in the first place? Just to learn? For performance reasons? something else? These are all valid reasons, but I don't know which one applies. — Peter Cordes, Dec 02 '17 at 11:53
I am doing a primitive game and I wanna learn how to optimize it. In C++ it works without any problem, but in asm... And I don't want to start using AVX which si harder for me if the easier think doesn't work and i can't find out why. And the worse for me on this is, that the results are the same till the shot should start going down. — Erik, Dec 02 '17 at 12:01
AVX is easier. You don't have to manage the x87 stack. **Which input values exactly are giving imprecise answers**? Make this a [mcve] with some actual numbers, because there's not enough here for me to run this with your inputs and test it or single-step it even if I wanted to. (I'd recommend checking the input and result of `fsin` for inputs where your function gives bad answers. That will let you know if you're hitting the problem cases where it's inaccurate; see the link in my answer. If so, it's unusable for you and you'll have to use a library function or write your own.) — Peter Cordes, Dec 02 '17 at 12:04
Every input values goes wrong after some time, but I have tested this the most: t = t + 0.01; //going from t = 0;; (*vypocitej_y)(t, 30.0, &y); and in asm: rychlost dq 40.0 ;v = rychlost;; cislo dq 180.0;; grav_zrychleni dq 4.906865 ;grav_zrychleni = 0,5*g — Erik, Dec 02 '17 at 12:32
Going from `t=0` to what? Are you getting close to `t=Pi/2`? And are the values just a bit wrong, or totally wrong? Read [mcve] again, and [edit] your question. And why are you converting from degrees? Also, can you translate your variable names into English? I can guess what some of them are, but debugging uncommented x87 code with non-English variable names is not fun enough for me to want to do it in my head. — Peter Cordes, Dec 02 '17 at 12:40
I've edited the code (I hope I didn't forget something). t = is now running to inf, because the result y never starts getting smaller and the shot never hit the terrain. If I should describe the problem in exampe: results y should be like: — Erik, Dec 02 '17 at 13:19
Oh, so you probably have a bug in your x87 code, not just numerical inaccuracy. That wasn't clear from the original question. Since you eventually want an AVX vectorized version, rewrite it with *scalar* AVX instructions like `vmulss`. It's easier to debug because you don't have the x87 register stack. — Peter Cordes, Dec 02 '17 at 13:27
If I should describe the problem in exampe: results y should be like: 1 2 3 4 5 4 3 2 1 and I am getting 1 2 3 4 5 6 7 8 9... — Erik, Dec 02 '17 at 13:34

Peter Cordes · Accepted Answer · 2017-12-03T05:37:35.353

Update: according to a comment, you're getting wrong numbers for a whole range of inputs, so presumably you just have a regular bug in implementing the formula, not an FP-specific rounding-error or numerical accuracy/stability type of problem. Single-step your function in a debugger for an input that gives a wrong answer, and look at the register values.

Or better, rewrite this with scalar AVX instructions, because scalar AVX is easier than x87, and you eventually want a vectorized AVX implementation anyway, so a working scalar implementation is a better starting point. For sin(), call a vectorized sin() implementation, or let gcc auto-vectorize your function with -O3 -ffast-math. (See https://sourceware.org/glibc/wiki/libmvec: glibc has vectorized math library functions.)

Starting with a scalar x87 implementation using the slow fsin instruction is probably the least useful starting point if you eventually want something that runs fast. Good clean C would be better than a sloppy asm implementation for an instruction set you're not even going to use. (And for the final optimized version, C with intrinsics would make more sense than hand-written asm in most cases). See http://agner.org/optimize/, and other links in the x86 tag wiki.

Storing angles in games

Store directions as [x,y] vectors, not angles in radian. (Or degrees). With a normalized xy vector, adding two angles becomes a 2x2 matrix multiply (by a rotation matrix). But sin becomes trivial: if you keep your vector normalized (x^2 + y^2 = 1.0) then sin(angle) = angle.y.

Avoid using actual angles whenever possible and use normalized vectors instead. You sometimes need atan2, but usually rarely enough that you can just use the plain library version.

If you store your xy pairs in a struct-of-arrays format, it will be SIMD friendly, and you can easily do stuff with 8 float x values and 8 float matching y values. Doing stuff with a direction vector packed into a single SIMD vector is usually NOT optimal; don't be fooled by the word "vector".

See also https://stackoverflow.com/tags/sse/info and especially SIMD at Insomniac Games (GDC 2015). This will help you understand how to design your program so you can later optimize it with SIMD in the places where that's worthwhile. (You don't have to vectorize everything at the start, but changing your data layout is often a lot of work so consider making your data SIMD friendly in the first place.)

Possible sources of numerical error (turns out not to be the real problem here)

One possible reason: The worst-case error for the fsin instruction for small inputs is actually about 1.37 quintillion units in the last place, leaving fewer than four bits correct.. Most modern math libraries don't use the fsin instruction to compute the sin function, because it's not fast and has poor accuracy for some inputs.

Also, depending how you built your code, something (like the MSVCRT startup, if you're on Windows and using an old version of it) may have set the x87 FPU to less than 80-bit precision (64-bit mantissa).

Why are you writing this in asm? Do you want advice on how to make it more efficient? You should return a float in st0 as the return value, instead of storing through a pointer arg. Also, don't use finit. I think you're only doing that because you don't balance the x87 stack with pops after you load stuff, so after repeated calls you'd otherwise get NaNs from x87 stack overflow. You're still returning with the x87 stack non-empty in a function that returns void, so you're still doing it wrong and could break the caller.

USe fstp or fmulp to leave the stack balanced. Use fld st0 instead of another load. Use fmul qword [grav_zrychleni] instead of a separate fld.

Or better, use SSE2 or AVX for scalar double precision math. Unless you really want 80-bit long double.

I am using that `finit` ecaxtly for the reason you said, NaNs. For example that `fmulp st(i)` does the same as `fmul` + pops the st(i)? About that return value, I don't think i can use that. My goal is to write it in fpu, understand it and after that rewrite what i can to AVX I have no idea how I would do that through return value. — Erik, Dec 02 '17 at 11:40
@Erik: `finit` is a bad band-aid solution, and a poor one. Yes, `fmulp` effectively pops `st0` and replaces the new `st0` with the product. See http://www.website.masmforum.com/tutorials/fptute/fpuchap8.htm#fmulp (and the rest of that x87 FPU tutorial (especially [the first chapter where it describes the register stack](http://www.website.masmforum.com/tutorials/fptute/fpuchap1.htm)), if you really want to use x87 for some reason). I could give you better advice if you explain *why* you want to do this with mostly-obsolete x87. — Peter Cordes, Dec 02 '17 at 11:47
I want use x87 because only this can count me sin and cos (or it was told to me). So i decided to do the whole count in fpu and if it start to work i wanna opimize it by using AVX (but still i will count that sin and cos in fpu) — Erik, Dec 02 '17 at 11:51
@Erik: the first paragraph of this answer explains that x87 `fsin` is usually not the best way to calculate the mathematical `sin` function, and might be exactly what's wrong. High quality math libraries like glibc's libm don't use it; I suggest calling the math library `sin` function. (And yes, SSE/AVX only provide the IEEE primitive operations that are required to give correctly rounded results (to the last ULP): `* / + -` and `sqrt`. There's no `sin` instruction in SSE/AVX, so you (or a math library) should implement it yourself. — Peter Cordes, Dec 02 '17 at 11:57
@Erik: do you want your AVX version to do 4 `double` values in parallel? You might get a tiny speedup that way, but the bottleneck will be running each value through `fsin` one at a time. It has throughput of one per 47-106 cycles on Haswell (according to http://agner.org/optimize/), but an AVX parallel `sin()` implementation could probably get at least some of the work done in parallel, giving you a *big* speedup. — Peter Cordes, Dec 02 '17 at 12:01
My plan was to count 8 positions y in 1 function call (I was told that AVX takes 8 values) and than use the results to draw a shot (use 1 by 1 after some delay). I know that counting sin in fpu will put the speed of program down, but it's only one calculation and its result i can use in AVX. — Erik, Dec 02 '17 at 12:17
@Erik: AVX 256b vectors hold 8 floats or 4 doubles. You're using `qword` so I assumed you wanted `double`, and apparently precision is already a problem. — Peter Cordes, Dec 02 '17 at 12:26
@Erik: It's not "only one calculation". It takes 50 to 100 times as long as multiply or add, because the internal implementation is micro-coded. (i.e. it's a library function built-in to the CPU). You can't just count instructions to determine performance, some instructions (like `fsin`) are extremely slow; some are kinda slow (like `fdiv` or `vdivpd`), and other are fast (like 2 `vmulps` can run every clock cycle, doing 8 `float` multiplies each. But each vector of results takes 5 cycles to be ready). See http://agner.org/optimize/. Also https://stackoverflow.com/q/45113527/224132. — Peter Cordes, Dec 02 '17 at 12:27
@Erik: BTW, I forgot to mention that you probably could/should avoid needing a vectorized `sin(theta)` in the first place by storing angles as (normalized) x,y direction vectors. This is how games (and other simulations) normally do it. See updated answer. — Peter Cordes, Dec 03 '17 at 05:39

Fault in calculation with small numbers in fpu?

1 Answers1

Storing angles in games