0

I like dabbling with optimization, primarily in the context of space by thinking about algorithmic flow. After experimenting with a few different scenarios, this one seems to be the best one I could dream up. Assume too, this is a must as when this procedure is called, sometimes DF will be set and other times not.

OJBECTIVE: Return DF to caller unaltered.

00  55       push   bp
01  89E5     mov    bp, sp
03  9C       pushfw

04  FD       std

    ... Function/Subroutine body

05  58       pop    ax          ; Retrieve original flags
06  F6C404   test   ah, DF      ; Was DF already set
09  7506     jnz    Exit        ; If so, nothing to do

; Reset DF without altering state of any other flags

0B  9C       pushfw
0C  8066FFFB and byte [bp-1], 0FBH  ; Strip DF from MSB of EFLAGS
10  9D       popfw

       Exit:
11  C9       leave
12  C3       ret

Excluding STD which will actually be part of the main body, this prologue/epilogue to address objective comes in at 18 bytes.


Has anyone implemented a method that comes in tighter than this?

Shift_Left
  • 1,208
  • 8
  • 17
  • 3
    What's wrong with just `popfw` to preserve all flags? Side note, most calling conventions mandate `DF` be cleared so usually just doing that is enough. Actually even in your code you can simply do `cld` instead of the `push/and/pop`. – Jester Feb 24 '18 at 02:24
  • 1
    I highly recommend either requiring DF to be cleared on function boundaries, or treat it as call-clobbered (i.e. you always need `cld` before a `rep`-string instruction). `popf` is *slow* (like 9 uops / one per 18 cycles throughput on Haswell), so saving/restoring an unknown DF state sucks. IIRC, `rep movs` / `rep stos` are only fast (on recent Intel) when DF is cleared, or at least they're fast*er*, so you'd only ever want `std` if it's a big win for code-size. – Peter Cordes Feb 24 '18 at 02:30
  • If the requirements are as stated then you just have to push the flags at the beginning with `pushfw` and then at the end restore it with `popfw` . You can then change the direction flag to whatever you want in the body of the function. I agree with the others about coming up with a calling convention that assumes the same DF state upon entry. If one was writing an interrupt handler then where DF may be set or cleared then the flags are already saved and restored automatically. – Michael Petch Feb 24 '18 at 03:14
  • The other possibility here is that you have not asked the question very clearly. Is it your intention for the function to change the flags so they can be checked by the caller and that is why you are trying to restore just the DF flag at the end? – Michael Petch Feb 24 '18 at 04:25
  • 1
    @MichaelPetch: IIRC, the OP asked a question about returning error codes in CF recently, so I think that is the explanation for this very clunky code. I was just thinking that `lahf` / `popf` / `sahf` at the end of a function might be good (for code-size) if you want to restore the callers FLAGS other than condition codes. OF is outside of the low 8 of FLAGS, though. Of course even better is to restore DF after the last use of DF, but before whatever sets CF and other flags before function exit. – Peter Cordes Feb 24 '18 at 04:32
  • @PeterCordes considering DF call-clobbered is the way this started out, but I wanted to avoid the relatively redundant code from the 19 other places this procedure will be called, which would essentially equate to 160 bytes extra code versus 18 in the procedure. I had dismissed `LAHF` and 'SAHF` as they were limited to low order bits of status register, but your last comment has given rise to an idea that I'll post if it works out. – Shift_Left Feb 24 '18 at 05:29
  • @MichaelPetch I had hoped that would have been assumed by context of the code, and in an attempt to not be to wordy, I can see how I was a little too succinct. That is the intent though, is to only return DF to caller as it was passed, but on that note, **my** ABI only depends on `ZF` & `CF` and maybe `SF` at some point, but that hasn't been a requirement to-date, so the state of those flags need to be returned to caller. – Shift_Left Feb 24 '18 at 05:37
  • @PeterCordes your points about speed will be very useful in protected and long modes, which doesn't mean to imply they aren't significant in real, but even though my boot and initialization process is more elaborate than it needs to be, it is in the grand scheme of things going to be short lived. But these snippets will eventually become the building blocks for their 32 & 64 counterparts. – Shift_Left Feb 24 '18 at 05:52
  • Your objective is very lacking. It only states `OJBECTIVE: Return DF to caller unaltered.` . To that end saving all the flags at the beginning, and restoring all the flags at the end meets that objective. Your objective is far moire than stated and makes your question confusing. I saw your objective and didn't even bother looking at the code at first. I didn;t need your current code to meet that requirement. The objective itself as worded is very simple. when I read your code after the fist comment I realized the code had the objective, and the objective didn't. – Michael Petch Feb 24 '18 at 05:53
  • 2
    Your objective really is to return all the flags the function sets and restore just the DF flag to its original setting upon exit. – Michael Petch Feb 24 '18 at 05:56
  • @Shift_Left: If you care only about code-size, then choose between having all flags (including DF) call-clobbered (and used as part of the return value if you want) vs. requiring DF clear on function entry / exit (except for private helper functions, which can use custom calling conventions). Functions that want to use `std` can then simply use `cld` before any `call` or `ret` they make. This is the choice that mainstream 32 / 64-bit calling conventions make. In the rare case that you need to make a function call inside a loop that traverses in descending order, maybe don't use `lods`... – Peter Cordes Feb 24 '18 at 06:15
  • 1
    Is the function body long enough (or performance requirements low enough)? Then you may instead test DF at beginning and self-modify epilogue to `cld/std` by that. Like: `pushfw` `pop ax` `shr ah,2` `add ah,0FCh` `mov [epilogue_restore_df_instruction],ah` = 13B (if you are free to play a bit with `si` or `di`, you can test DF even without `pushfw`, but that will have bigger code size, but better performance) EDIT: wrong code, will not produce strictly FC/FD opcode, wait a sec for fix... – Ped7g Feb 24 '18 at 06:28
  • @Ped7g: interesting, but don't you need to mask the top byte of FLAGS? You seem to be assuming known values for the bits other than DF. (And BTW, the self-modifying-code machine-clear threshold is something like a store within 2k or 4k of IP, so it would have to be a *very* large function to not have massive slowdowns). – Peter Cordes Feb 24 '18 at 06:31
  • @PeterCordes yes, it was first version and done just as by-thought, so obviously wrong one... anyway, fix is again 13B: `pushfw / pop ax / xor al,al / shr ah,3 / adc al,0FCh / mov [epilogue_restore_df_instruction],al` (12B in prologue, +1B the cld/std placed in epilogue) ... and yes, it will be very slow. – Ped7g Feb 24 '18 at 07:23
  • And about practical solution for OP ... basically use always DF=0 everywhere, there are very few reasons to use DF=1, and such code should be very small, and do `std+cld` before/after. If such code itself needs to call subroutine, do `cld` `call` `std` if you can't avoid that. That's just +2B for every block needing DF=1, and those should be extra rare, as normally you don't need that at all except things like `memmove` for overlapping regions. – Ped7g Feb 24 '18 at 12:01

1 Answers1

3

You have one good option:

  • require DF clear on function entry / exit (except for private helper functions, which can use custom calling conventions). Functions that want to use std can then simply use cld before any call or ret they make.

    Using other flags as part of the return value is trivial: cld and std don't affect them.

    In the rare case that you need to make a function call inside a loop that traverses in descending order, maybe don't use lods or other string instructions at all. They're not magic, and the occasional dec si / mov al, [si] or whatever is not a disaster for code-size. Or it means a cld and std inside the loop, only 1 byte each.

    Most of the time you want DF clear so you're looping upwards, in which case you can have function calls inside loops without any issue. (Not all of the time, but this design is the best for the common case, and the uncommon cases can be handled without too much pain).

One fairly good option:

  • Have all flags (including DF) call-clobbered (and used as part of the return value if you want). Every string op needs a cld or std inside the same function. So this isn't really a very good option.

And one mediocre option:

  • Your current calling convention where DF is call-preserved and has unknown value on function entry. Every function that uses a string instruction needs a cld or std because it can't assume anything, and a save/restore of FLAGS. (Unless you optimize based on known callers and the state they put DF into).

The only time this has an advantage is in a loop containing a function call that also wants DF set.

When you want to return a status in the condition codes of FLAGS as well as save/restore DF, simply put the instruction that sets condition-codes in FLAGS after the popf that restores the caller's DF.

In functions that don't have interesting status in condition codes, simply use popf to restore all the callers FLAGS. You don't need to merge the caller's DF into the current function's FLAGS in functions that don't return anything interesting in FLAGS.

In the rare case where you can't easily move the last string instruction before the last flag-setting instruction, it might be smaller to emulate the string instruction. Instead of inc or dec, leave flags untouched with lea si, [si +- 1]. (Both SI and DI are valid in 16-bit addressing modes, so lea is usable on them.)

None of lods / stos / movs are magical, not even their rep versions. If you don't care about performance (only code-size) you could emulate even rep movs without touching flags, using the slow loop instruction and a spare register (save/restore one with push/pop if needed)


 0B  9C       pushfw
 0C  8066FFFB and byte [bp-1], 0FBH  ; Strip DF from MSB of EFLAGS
 10  9D       popfw

Your code-sample doesn't restore the caller's DF, it unconditionally clears it. It's equivalent to cld.

To restore the caller's DF, you'd need to extract the DF bit from the first pushf, merge it into the FLAGS value from the pushf at the end of the function, and then popf. This is obviously possible, but significantly more inefficient (and larger in code-size) than what you show.


Also note that popf is slow: on Haswell it's 9 uops, and has one per 18 cycle throughput. If you only care about code-size, not performance, designs that require pushf/popf aren't necessarily bad, but it seems to me that requiring DF clear on entry/exit will win for code size most of the time, as well as performance.

This is what 32 and 64-bit calling conventions choose for handling DF, and I don't see why it wouldn't work well in 16-bit code, too.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • The conditional `*test ah,4*`just before code between 0AH and 11H takes care of unnecessarily resetting DF as bit 2 in AH is DF. To kill two birds with one stone and because SI will always increment and DI always decrement in the body, `LODS` or `STOS` can only work on one or the other anyway, so using `INC` & `DEC` respectively addresses speed and probably even code size simultaneously. – Shift_Left Feb 24 '18 at 15:58
  • 1
    @Shift_Left: Ah I see. But `test` writes all flags, so you leave the function with flags set according to that `test`, not anything interesting from the function logic. So you might as well just `popf` and restore the caller's flags. And it's still insane: if you're going to use a conditional branch, jump over a `cld` instead of jumping over that `pushf/and/popf`. (Possibly you could hack something with `jcxz` to conditionally branch without destroying condition codes? But x86 can't AND and shift without writing flags, unless you use MMX/SSE2. – Peter Cordes Feb 24 '18 at 16:09