9

At my university we were just introduced to IA32 x87 FPU. But we weren't informed how to clear the FPU-Stack of no longer demanded elements.

Imagine we're performing a simple calculation like (5.6 * 2.4) + (3.9 * 10.3).

.data
        value1: .float 5.6
        value2: .float 2.4
        value3: .float 3.8
        value4: .float 10.3

        output: .string "The result is: %f\n"

.text
.global main

main:
        fld     value1          # Load / Push 5.6 into FPU
        fmul    value2          # Multiply FPU's top (5.6) with 2.4
        fld     value3          # Load / Push 3.8 into FPU
        fmul    value4          # Multiply the top element of the FPU's Stacks with 10.3
        fadd    %st(1)          # Add the value under the top element to the top elements value

.output:
        # Reserve memory for a float (64 Bit)
        subl $8, %esp
        # Pop the FPU's top element to the program's Stack
        fstpl (%esp)
        # Push the string to the stack
        pushl $output
        # Call printf function with the both parameters above
        call printf
        # Free the programs stack from the parameters for printf
        addl $12, %esp

.exit:
        movl $1, %eax
        int $0x80

The problem is: After popping the FPU's top element which holds the calculation's result. How do I free the FPU's stack from the now remaining newly top element which holds the result of (5.6*2.4).

The only way I'm capable to imagine is freeing some more program stack and popping elements from the FPU's stack until all no longer demanded elements are removed.

Is there a way to directly manipulate the top pointer?

tmuecksch
  • 6,222
  • 6
  • 40
  • 61
  • 2
    A C compiler is usually good at generating code like this. Mine uses FMULP instead of FMUL, problem solved. – Hans Passant Nov 10 '13 at 16:57
  • Note that you *can* manually move top around with `fincstp` and `fdecstp` and mark regs free with `ffree`, but it is better to avoid that. – gsg Nov 10 '13 at 17:02
  • If you know how many elements you have on the FPU stack you can execute `ffree st(0)` and `fincstp` in a loop. – Michael Nov 10 '13 at 17:02
  • @gsg why is it better to avoid that, as I know how many elements i've put on the FPU stack. – tmuecksch Nov 10 '13 at 17:10
  • @tmuecksch Using the `fp` instructions intelligently results in fewer instructions (and possibly more readable code). – gsg Nov 10 '13 at 17:18
  • 1
    (edit, didn't notice this was an old question). I guess x87 is a practical example for teaching a stack-based register set. If you're actually writing new FP code, use SSE2 though, unless backwards compat with 10-year-old AthlonXP CPUs is important. http://stackoverflow.com/tags/x86/info has links to lots of good stuff, including a very good x87 tutorial / guide: http://www.ray.masmcode.com/tutorial/index.html. That covers not just the individual instructions, but how to use them together. It also spends time on explaining how the FP stack registers work. – Peter Cordes Nov 06 '15 at 21:35

4 Answers4

7

To accomplish that you don't have any garbadge on the stack you need to use the FADDP and FMULP and similar instructions.

Quonux
  • 2,975
  • 1
  • 24
  • 32
  • So, you suggest to hold the provisional result in the program's stack? (As the mentioned operations pop the result directly after evaluation). – tmuecksch Nov 10 '13 at 17:08
  • 4
    indeed, use the stack for your adantage, for ex. to calculate A * B + C * D you do push A; mulp B; push C; mulp D; addp – Quonux Nov 10 '13 at 17:19
7

In case someone like me comes here searching for the best way to clear stack I found this simple solution to be the best:

fstp ST(0) ; just pops top of the stack
Dan M.
  • 3,818
  • 1
  • 23
  • 41
  • 2
    Or `FNINIT` to clear *all* the FP registers, regardless of how many were in use before. But yes, `fstp st(0)` is the most efficient way to *just* pop the top of the stack, discarding the result. – Peter Cordes Nov 06 '15 at 21:40
  • 1
    @PeterCordes I believe even eight FSTP instructions would be faster than a FNINIT instruction given the later is microcoded. Plus with FSTP you don't have to set the FP control word back to what it should be. – Ross Ridge Jun 24 '17 at 17:31
  • @RossRidge: good point. On Skylake, `FNINIT` is 18 uops with a throughput of one per 78 cycles. `FFREE st(0)` has a throughput of one per 0.5 cycles. Being micro-coded doesn't automatically mean unusably slow (e.g. `VGATHERQPD` is 5 uops with one per 2 cycle throughput on Skylake, and anything more than 4 uops means it has to come from the microcode ROM and is stored in the uop cache as a MS-ROM pointer instead of the uops directly), but it turns out that `FNINIT` specifically is a bad suggestion on Intel and AMD, except for code-size. – Peter Cordes Jun 26 '17 at 19:52
  • On AMD Bulldozer and Ryzen, `FFREE` has 0.25c throughput (and it's the same perf as `FSTP st(0)` on Intel), so if you need to empty all 8 regs from an unknown state, that's probably your best bet. Or as [CP Taylor's answer points out](https://stackoverflow.com/questions/19892215/free-the-x87-fpu-stack-ia32/44738387#44738387), `EMMS` sets the tags for all x87 regs to unused. On AMD Bulldozer and Ryzen it's only 1 uop, but on Intel it's 10. So 8x FFREE or FSTP is faster on Intel, but EMMS is not bad (and a decent tradeoff for code-size vs. performance) – Peter Cordes Jun 26 '17 at 19:59
  • I'm working on 256B intro right now, so **coding for size** I came to these conclusions: I'm using either `ffreep` to pop `st0`, or `fcompp` to free two slots with one instruction (`fcom` will affect FP flags, which I simply ignore and don't mind it). `fninit` is worth when full stack needs to be released, even if you need to adjust CW afterwards. Of course this advice is completely wrong when coding for performance, can't see anything wrong about `ffreep` in such case (except when you can adjust algorithm to avoid junk pop completely by popping intermediate values during calculation). – Ped7g Jan 17 '18 at 10:27
  • Thinking about it, and the `ffreep` being sort of "undocumented, later accepted", isn't the `fstp st0` actually the same opcode as `ffreep`? I can answer that in 20s... No, it is not, it's `DDD8` vs `DFC1` ... @PeterCordes any comments about performance of those two? Just by the instruction name the `ffreep` sounds more accurate to the original intent of programmer, i.e. I would prefer it over puzzling `fstp st0`. – Ped7g Jan 17 '18 at 10:31
  • 1
    @Ped7g: Yes, if Intel documented `ffreep`, it would be the canonical way to pop the x87 stack without doing anything else. It might avoid using an execution unit for the register copy, and just update the register-renaming state. But it turns out that it `ffreep`'s internal implementation isn't optimized, I guess because it's slightly rare. Even on Pentium III, `ffreep` was 2 uops, while `fstp` and `ffree` were only 1 each. On Sandybridge-family, both fstp and ffreep are single-uop, same performance. Agner doesn't list timings for `ffreep`, only `ffree` being faster than `fstp`. – Peter Cordes Jan 17 '18 at 15:49
  • 1
    @Ped7g: ref.x86asm.net's geek64 table [notes that `ffreep` is mentioned in Intel and AMD optimization manuals](http://ref.x86asm.net/geek64.html#gen_note_FFREEP_DF_1), with section references. They don't say whether it's recommended or discouraged, though. On modern AMD, `ffreep` is faster than `fstp` and takes no execution unit if it performs the same as `ffree`. Maybe Instlatx64 has data on it, but I didn't check. – Peter Cordes Jan 17 '18 at 15:53
3

emms can also be used to mark every member of the f.p. stack as free. This has the advantage over finit that it doesn't change any flags in the f.p. control or status words (exception masks, etc.)

CP Taylor
  • 111
  • 6
  • According to Agner Fog's tables, EMMS on AMD Bulldozer/Ryzen is as fast as 1 FFREE. But it can be pretty slow on Intel: 31 uops, one per 18cycle throughput on Sandybridge. (Better on Skylake: 10 uops, one per 6c throughput, so it's only somewhat worse performance than 8x `FFREE st(i)` instructions.) – Peter Cordes Jun 26 '17 at 20:16
1

There are several instructions that can perform operations as the one you're seeking. FDECSTP decrements the stack pointer (without doing anything else), FFREE marks a slot as empty (without touching the stack pointer though). The solution mentioned above with the FADDP or FMULP is often nicer, though.

You should consider downloading the Intel Architecture Manuals. They contain the full instruction set of the Intel CPU family.

PMF
  • 14,535
  • 3
  • 23
  • 49