2

I'm computing the average of 3 marks:

g0  dw  70
g1  dw  100
g2  dw  65  

with

xor rax, rax
xor rcx, rcx

mov ax, [g0]
inc rcx

add ax, [g1]
inc rcx 

add ax, [g2]
inc rcx

xor rdx, rdx

idiv rcx

The grades don't need to be words, because bytes would be enough as for the average, but the sum must be a word to avoid overflow (with this algorithm, that is).

How can I convert the grades to bytes? Using db isn't enough, because then I would have to change ax to al, but it would cause an overflow at the end. I cannot instruct the mov/add to only take a byte from [g*], as it would cause a mismatch in the operand sizes.

I'm using yasm.

J. Poe
  • 21
  • 1
  • if you know the sum is word only, you can do the `dx:ax / cx` division only by `idiv cx` ... in most of the cases in x86-64 using word registers doesn't bring performance advantage, or even may incur extra penalties due to more complex register management of modern CPU, but in case of DIV/IDIV instruction the 64 bit variants may be still considerably slower than 32/16/8 bit ones. So it's a thing to consider (as your code can be written as 16b only, although in x86-64 mode I would probably use 32b directly for performance reasons (shorter instruction opcodes)). – Ped7g Sep 27 '18 at 18:37
  • and if you insist on `idiv` (signed division) you should have also other parts in signed-variant (using then `movsx` instead of suggested `movzx`, and `cqo` to sign-extend `rax` into `rdx:rax` instead of `xor rdx,rdx` .. or use `div` for unsigned division with your and zx485 code. – Ped7g Sep 27 '18 at 18:42
  • @Ped7g I asked this as I thought I could benefit from using less bytes. Shall I keep this in mind only if on short memory constraints? – J. Poe Sep 27 '18 at 19:18
  • Particular situation may require particular solution, so keep practicing anything you can think of. The performance questions on x86-64 are never simple, the machine is very complex. Your way of changing marks to 8 bit type will conserve memory, so in case you would have billions of marks on input side, it definitely makes sense (but then you would need more than 64 bit for the total sum, pair of two 64b registers would provide 128b sum total, which would be enough in this particular case of couple of billions of marks). Shorter code does make it easier for L1 instruction cache, etc... ... – Ped7g Sep 27 '18 at 19:39
  • and then there are lot more complex (minor!) penalties for some non-obvious things, like using parts of registers in unpredictable way forcing the CPU to merge results of two independent code-lines back into one value at some instruction, limiting the out-of-order execution to do more stuff in different order. Anyway, focus first on writing 100% correct code (i.e. the signed div vs unsigned values is major issue when compared to subtle optimizations) which is easy to maintain. Once you have working code, you can measure it's performance and see where is the bottleneck, and address that part. – Ped7g Sep 27 '18 at 19:42

1 Answers1

2

You can change the variables to bytes if you use another register for the adding. So the following is possible:

g0  db  70
g1  db  100
g2  db  65  

Use the MOVZX instruction and indicate the memory reference size BYTE:

xor ecx, ecx              ; clear counter register and break dependencies

movzx eax, BYTE [g0]      ; movzx loads g0 and fills the upper bytes with zeroes
inc ecx

movzx edx, BYTE [g1]      ; move byte from g1 to dl and zero-extend
add eax, edx              ; add the widened integers
inc ecx 

movzx edx, BYTE [g2]      ; the upper half of RDX is zeroed automatically by this instruction, but 32-bit is fine.
add eax, edx
inc ecx

xor edx, edx
div ecx                   ; unsigned division of EAX / 3
                          ; quotient in EAX, remainder in EDX
;mov [average], al        ; or do whatever you want with it.

There's also no need to use 64-bit operand size. 32-bit is the "standard" operand-size for x86-64 for most instructions.

Of course, you can change the eax and edx register references to rax and rdx, respectively, because the values have been zero-extended to the full register width. If you had more than 2^32 / 100 grades to add, you could use that to avoid overflow.

If you're repeating this a fixed number of times, mov ecx, count instead of using that many inc instructions. inc would make sense if this was in a loop body and you were incrementing a pointer to an array of grades, but part of the benefit of fully unrolling is not having to inc anything to count interations.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
zx485
  • 28,498
  • 28
  • 50
  • 59
  • does yasm use `byte` size by default without specifying it explicitly (in the `movzx`)? Or does it track it from the `db` like MASM? (I'm curious) – Ped7g Sep 27 '18 at 18:29
  • and in this particular case you can then remove `xor rax,rax` as the first `movzx` will set it fully. Also to learn OP something new, you can show that `movzx eax,..` is enough, as the 32b register write will clear upper 32 bits automatically (same goes for original OP `xor rax,rax` where `xor eax,eax` is enough to get the same result, but encodes with 1 byte shorter opcode). – Ped7g Sep 27 '18 at 18:32
  • 1
    Thanks for your additions. I added the missing `BYTE` directives for NASM/YASM. – zx485 Sep 27 '18 at 18:35
  • @Ped7g In zx485's code I should use `xor ecx, ecx` and `xor edx, edx`; should I always prefer them over their `r64` counterparts? – J. Poe Sep 27 '18 at 19:20
  • @J.Poe basically yes, they are 1 byte shorter encoded, and do the exactly same operation. But don't stress yourself over it too much, if you prefer for the source clarity at this moment `rdx`, use that. But you will run into that clear-upper-32 anyway once you will use 32 bit parts of registers in 64b mode, so you should understand that feature is there by design (by AMD company). (as the adage goes, "premature optimization is root of all evil" :) ... It's not true, bloated careless programmers are the root, but premature optimizations are in top 5 at least :) ) – Ped7g Sep 27 '18 at 19:52
  • @J.Poe: re: efficiency: 64-bit `idiv rcx` is about 3x slower than `idiv ecx`. ([C++ code for testing the Collatz conjecture faster than hand-written assembly - why?](https://stackoverflow.com/a/40355466)). The sum can't be > 2^32. Also, `div ecx` would be more correct, because you're zeroing RDX instead of sign-extending RAX into RDX:RAX `cqo` or EAX into EDX:EAX with `cdq`. I don't see the point of using `inc` 3 times instead of `mov ecx, 3`, if the number of elements is an assemble-time constant. – Peter Cordes Sep 28 '18 at 02:25
  • I wanted to use this as a duplicate for another question that needs movzx loads, but using AX instead of EAX is really silly and a poor example. I'm going to edit it to make it saner, if that's ok. – Peter Cordes Sep 29 '20 at 03:15