0

How is the Auxiliary Flag calculated in x86 Assembly?

The majority of the resources I can find explain that, the Auxiliary Flag is set to '1' if there is a carry from bit 3 to bit 4.

Wiki :

It indicates when a carry or borrow has been generated out of the least significant four bits of the accumulator register following the execution of an arithmetic instruction.

Example:

mov al,-14 *(1111 0010)
mov bl,-130 (0111 1110)
sub al,bl (1111 0010 – 0111 1110)

* the brackets show the stored binary patterns.

Result: 1111 0010 – 0111 1110 will be calculated as 1111 0010 + 1000 0010 using two's complement, giving the result 0111 0100 + OF.

In the example given the AF is set (=1). I do not understand why this is, as I cannot see that there has been a carry from bit 3 to bit 4. The addition of 0010 + 0010, the least significant nibble, equals 0100, no carry. The least significant four bits of the accumulator register, have changed from 0010 to 0100 ('2' to '4'), there has been no carry from the lower nibble to the higher nibble?

Please could someone kindly explain where my thinking has gone awry?

I have a suspicion that the abundance of 'negatives' is throwing me off at some point, as I have tried several different examples in the debugger and they all act in accordance with my expectations, bar this one example.

Andrew Hardiman
  • 929
  • 1
  • 15
  • 30
  • 1
    `0 - 1` produces a borrow. AF is set by `sub` the same way CF is for the top bit. – Peter Cordes Jul 13 '18 at 13:38
  • @PeterCordes Ah!Yes! Thank you. However,this leaves me even more confused. My understanding was that the CPU performed subtraction by taking the two’s complement of the subtrahend and performing an addition, discarding the ‘overflow’ from the result. Indeed, I was of the assumption that the sub mnemonic was there for the benefit of the programmer. Our example, the instruction `(-14) – (-130)` becomes `242 + 130 = 372`, the lower byte containing 116 (the correct answer) and OF set to 1?. There is no actual subtraction, and therefore no borrow has taken place, and consequently the AF is not set. – Andrew Hardiman Jul 13 '18 at 14:36
  • 1
    BTW: `mov bl,-130` -130 doesn't fit 8 bits, so you end with `126`, i.e. `al = -14 - +126` or when you treat it as 8 bit unsigned, it's `al = 242 - 126`. – Ped7g Jul 13 '18 at 14:41
  • @Ped7g Where I’m confused; regardless of if I enter the value `-130` or `126`, the actual stored binary pattern is: `0111 1110`. Likewise for `-14` or `242`, stored as `1111 0010`. Subtract one binary from another: `1111 0010 – 0111 1110`. This equates to `0111 0100`. However, my understanding was that in order to perform a subtraction, the CPU simply took the two’s complement of the subtrahend and performed an addition, like so `1111 0010 + 1000 0010`, setting the OF if the result exceeds 8-bits. If this is indeed the case, I’m confused as to why the AF has been set, representing a borrow. – Andrew Hardiman Jul 13 '18 at 15:54
  • @Ped7g I was reading the answers given here (https://stackoverflow.com/questions/5793740/how-does-cpu-do-subtraction#5793952), which state, that every negative number is converted to two's complement. Therefore, to find `A-B`, we find the two's complement of B and add. Therefore, `242 - 126`, in my above example becomes `242 + 130`. Where does the CPU 'see' a subtraction, and a borrow, and consequently, therefore, set AF. – Andrew Hardiman Jul 13 '18 at 16:20
  • The answer you linked does not say that: it says "we ***can*** just negate B and add". That does not mean the processor does that. The processor has no idea whether you are working with signed or unsigned values. – Weather Vane Jul 13 '18 at 18:53
  • 1
    I do believe the x86 has separate subtraction implemented in HW, which may/may not resemble addition with negated value, but in the end it's just implementation, which has to follow the definition. `1111_0010 - 0111_1110` is doing fine for b0 and b1 (producing 00) then for b2 0-1 = 1 + borrow, b3 0-2 = 0 + borrow and this borrow goes to AF. Then b4 1-2 = 1 + borrow, b5 and b6 is same, and finally b7 1-1 = 0 + no_borrow (this one goes to CF), in total the 0111_0100 is result with CF=0, OF=1, AF=1, ZF=0, PF=1, SF=0. BTW OF is related to signed math, and `-14 - 126 = -140` which overflows: OF=1 – Ped7g Jul 13 '18 at 19:00
  • @Weather Vane Danish94’s answer: “every negative number is converted to 2's complement”, not the accepted answer. – Andrew Hardiman Jul 13 '18 at 19:00
  • That part about "every negative number is converted" is at compile time, how the -14 is encoded as 1111_0010 in 8 bits. It's not conversion done by CPU at runtime, the CPU is already provided by that bit pattern, which is enough for the `add/sub` instructions to do their work per-bit, without really bothering if the value was signed or unsigned, or if it was encoded as two's complement, or somebody is instead adding 8 bit flags for whatever reason, instead of or-ing them, etc... the `sub` needs just the bit values, no context. – Ped7g Jul 13 '18 at 19:04
  • 1
    Note that the Auxiliary Flag is considered when working with BCD values, by instructions such as `AAA` and `DAA`. Generally the flags are set regardless of whether you will need them: the processor does not know your intention when adding - whether the values are signed or unsigned, decimal etc. They are tested afterwards as the programmer or compiler decides in the context. For example different flags are tested for a comparing signed and unsigned operations, but the actual addition or subtraction is the same. – Weather Vane Jul 13 '18 at 19:24
  • @Ped7g Thank you. This all makes sense if the actual arithmetic under-the-hood is indeed a subtraction. My assumption in the original question was that a subtraction was in actuality performed as an addition, using the two’s complement of the subtrahend (I’m sure I have read this somewhere, although I could well be mistaken!). In the case of underlying binary addition, I could not make sense of the borrow to the lower nibble, and hence the setting of the AF. I’m likely going to deep with it, I should probably walk before I can run! Thanks again. – Andrew Hardiman Jul 13 '18 at 20:25
  • 2
    I don't know if the math under-the-hood is subtraction, it still may be addition with some extra mile to set flags "as subtraction". In programming you have to understand the difference between definition/contract/API and implementation. If you can fulfil the contract while implementing by addition with extra flag handling, then you are free to do that as HW designer. But the `sub` contract says that the AF is set when the "borrow" is used when going from b3 to b4. If you are just learning assembly, don't bother about implementation in the chip too much, because modern x86 chip is "insane". – Ped7g Jul 13 '18 at 21:23
  • 3
    I mean, it's definitely good to have some basic idea, but actually going as far as learning only the basics from the beginning of microprocessor era (like 8080 and 8086) is essential from programming point of view, every detail after that (caching, FPU history, micro architecture, multi core, etc...) is "diminishing returns", i.e. you will have to study lot more and harder, and the direct effect on the code you are writing as programmer will be smaller and smaller. If you are expert on performance tuning, then you certainly have to understand even modern x86, but that usually takes years... – Ped7g Jul 13 '18 at 21:29
  • @Ped7g fantastic, thanks. I guess this was what I was looking for really, as the basic arithmetic from a source code point of view is pretty simple; the implementation is not crystal clear to me though, hence the confusion regarding the flags. Like I said, probably learning to run before I can walk. I suppose the reason I wanted to learn asm in the first place was because I had been using Python for a while and I was constantly asking myself questions about what things actually mean or how they work. Anyhoo, thanks again. If you want to transpose your comments into an answer, I can accept. – Andrew Hardiman Jul 13 '18 at 22:31

1 Answers1

4

The sub instruction on x86 CPUs is "real" instruction since the first chip 8086, i.e. it's not some kind of assembler convenience, which gets translated as negation + add, but it has it's own binary opcode and the CPU itself will be aware it should produce result of subtraction.

That instruction has definition from Intel, how it does affect flags, and the flags in this case are modified "as if" real subtraction is calculated. That's all you need to know when you are focusing on programming algorithm or reviewing correctness of some code. Whether the chip itself implements it as addition, and has some extra transistors converting flags to the "subtraction" variant, is "implementation detail", and as long as you want to know only result, it's not important.

The implementation details become important while you are tuning particular piece of code for performance, then considering the inner architecture of the chip and implementation of particular opcodes may give you ideas how rewrite particular code in somewhat more unintuitive/non-human way, often even with more instructions than "naive" version, but the performance will be better, due to better exploitation of the inner implementation of the chip.

But the result is well defined, and can't change by some implementation detail, that would be "bug in CPU", like the first Pentium chips did calculate wrong results for certain divisions.

That said the definitions of assembly instructions are already leaking implementation details like no other language, because the assembly instructions while designed are half-way on the path "what is simple to create in HW transistors" and half-way "what makes some programming sense", while other higher level programming languages are lot more biased toward "what makes sense", only reluctantly imposing some cumbersome limits from the HW implementation, like for example value ranges for particular bit-size of variable types.

So being curious about the implementation and why certain things are defined as they are (like for example why dec xxx does NOT update CF flag, while otherwise it is just sub xxx,1) will often give you new insights into how certain tasks can be written more effectively in assembly and how chips did evolve and which tasks are easier to compute than others.

But basics first. The sub instruction updates flags as if subtraction was calculated, and the sub instruction is not aware of any context of the values it is processing, all it gets is just the binary patterns of the values, in your case: 1111_0010 – 0111_1110 which is when interpreted in signed 8bit math "-14 - +126" (-130 doesn't fit into 8 bits, so it got truncated to +126, good assembler will emit warning/error there), or when interpreted in unsigned 8b math "242 - 126". In case of signed math the result should be -140, which gets truncated (overflow happens, OF=1) to 8b value +116, in case of unsigned math the result is +116 without unsigned overflow (carry/borrow CF=0).

The subtraction itself is well defined per-bit, i.e.

         1111_0010
       – 0111_1110
       ___________
result:  0111_0100
borrow:  0111_1100
              ^ this borrow goes to AF
         ^ the last borrow goes to CF
         ^ the last result bit goes to SF
  All zero result bits sets ZF=1
  PF is calculated from only low 8 bits of result (even with 32b registers!)
  where PF=1 means there was even number of set bits, like here 4.

You can go from right to left, and do per-bit subtractions, i.e. 0-0=0, 1-1=0, 0-1=1+b, 0-2=0+b, etc.. (where +b signals need of "borrow", i.e. the first operand got borrowed +2 (+1 in next bit) to make the result valid bit value +0 or +1)

BTW how exactly is OF set on bit level is a bit more tricky, there's some nice Q+A here on SO, you can search for it, but from math point of view, if the result gets "truncated" in signed interpretation (like in this example), then OF is set. That's how it is defined (and implementation conform to that).

As you can see, all flags are set as defined, the sub doesn't even know, if the first argument is -14 or +242, as that doesn't change anything on the bit level, the instruction will just subtract one bit pattern from the other and set up all flags as defined, done. What the bit patterns did represent, and how the flag results will be interpreted, that's up to the following instructions (logic of code), but not a concern to the sub itself.

It's still possible the subtraction is implemented by addition inside the CPU (although very unlikely, it's not difficult to implement subtraction), with some more extra flag handling to fix the flags, but that depends on the particular chip implementation.

Mind you, the modern x86 is quite a complex beast, translating classic x86 instructions into micro-code operations first, reordering them to avoid stalls (like waiting for value from memory chip) when possible, executing sometimes several micro-operations in parallel (up to 3 operations at one time IIRC), and using hundred+ physical registers which are dynamically renamed/mapped to originals (like al, bl in your code), i.e. if you would copy those 3 lines of asm twice under itself, the modern x86 CPU will actually execute it probably quite in parallel with two different physical "al" registers and then the next code asking for result in "al" will get that value from the later one, the first one is obviously discarded by the second sub. But all of these are defined+created to make the observable result "as if the classic 8086 did sequentially run each instruction separately over real single physical AL register", at least in single-core sense (in multi-core/thread setup there are additional instructions to allow programmer to serialize/finalize the results at certain point of code, then the other core/thread may check them to see them in consistent way).

So as long as you are just learning x86 assembly basics, you don't really need to even know that there's some microarchitecture inside modern x86 CPU translating your machine code to different one (which is not directly available to programmers, so there's no "modern x86 micro-assembly" where you can write those micro-ops directly, you can only produce the regular x86 machine code and let the CPU handle that internal implementation itself.

Ped7g
  • 16,236
  • 3
  • 26
  • 63
  • 1
    None of the OoO exec stuff matters for add/sub: it's a single-uop instruction on all CPUs. My understanding is that each integer execution unit has a single add/sub/and/xor/or unit that can do any of those operations depending on some control lines that e.g. disable carry propagation (add -> xor). – Peter Cordes Jul 15 '18 at 00:53
  • 1
    Re: max uops that can execute in parallel: Sustained throughput of 7 on Skylake (http://agner.org/optimize/blog/read.php?i=415#857), higher in bursts when an input becomes ready that uops for every port were waiting for (4x ALU ports, 3x load/store-address, 1x store-data). Fused-domain front-end throughput of 4, but micro-fusion of stores and loads allows getting work for more execution units through that bottleneck. Ryzen has a wider front-end (5 or 6 uops), but similar back-end. Maybe more SIMD ports, and they don't compete for with integer ALUs for ports. Burst 8x ALU + load + store. – Peter Cordes Jul 15 '18 at 00:58
  • @PeterCordes the OoO "matters" if you have multiple add/sub instruction in the code, like when you copy the source code in question 3-4 times under itself, the CPU will probably "skip" through the pointless copies discarding result as fast as possible a put the last one in the reordering queue as much into front as soon as it will detect the following code depends on the result (postponing the earlier ones), I guess.... anyway, this is far beyond my tiny knowledge of modern x86 CPU, so I better don't get too deep with this, the paragraph in answer was more like *"hic sunt dracones"* warning. – Ped7g Jul 15 '18 at 08:41
  • I meant for how the execution unit is designed. Obviously the OoO machinery benefits the code running on the CPU! – Peter Cordes Jul 15 '18 at 08:42
  • According to Intel, Sandybridge-family has some support for discarding uops when their outputs won't be read, at least for the flag-setting uops that are part of a variable-count shift ([INC instruction vs ADD 1: Does it matter?](https://stackoverflow.com/q/36510095)), and [also for `rdrand`](https://stackoverflow.com/questions/10484164/what-is-the-latency-and-throughput-of-the-rdrand-instruction-on-ivy-bridge/11042778#11042778). In this case, the front-end will be the bottleneck: dependencies aren't known until after register-renaming so probably no cancelling would happen. – Peter Cordes Jul 15 '18 at 08:50
  • @Ped7g Thank you for this; I’m learning in isolation, so answers like this are really helpful. I’ve found a couple of Q+A that tie in nicely to the main points discussed: [carry/overflow & subtraction in x86](https://stackoverflow.com/questions/8965923/carry-overflow-subtraction-in-x86?rq=1) and [Overflow and Carry flags on Z80](https://stackoverflow.com/questions/8034566/overflow-and-carry-flags-on-z80/8037485#8037485). They explain the math behind the overflow etc. Hopefully they will come in useful. – Andrew Hardiman Jul 15 '18 at 16:29