How can I multiply float in Assembly? I have a value in ebx and want mutiply this for 0.65
mov eax, ebx
mov ecx, 0x0.65 ;; how do this?
mul ecx
mov ebx, eax
How can I multiply float in Assembly? I have a value in ebx and want mutiply this for 0.65
mov eax, ebx
mov ecx, 0x0.65 ;; how do this?
mul ecx
mov ebx, eax
Do you want to treat your input as signed or unsigned? If it's signed, converting to double
and back is actually easy and fast (with SSE2): look up CVTSI2SD / MULSD / CVTSD2SI in the insn set ref (links in the x86 tag wiki).
It's probably actually faster than using integer IDIV on modern CPUs, but perhaps slower than compiler tricks for dividing by a compile-time constant 100.
However, since you're using MUL, probably your input is unsigned, so it's actually pretty inconvenient to convert to double
, at least on a 32-bit machine. On x86-64, you can just zero-extend a 32-bit integer to 64-bit, and then treat that as a 64-bit signed integer.
Put a C function that does what you want on http://gcc.godbolt.org/ and look at the asm output (with -O3)
re: which is faster:
The naive integer way, for 32-bit mode, handling the full range of possible unsigned 32-bit inputs. (Not sure if you can go faster with 64-bit ops, since div r/m64
is much slower than div r/m32
on recent Intel CPUs: see Agner Fog's insn tables.)
# input in eax
mov ecx, 65
mul ecx # edx:eax = prod = eax*65
add eax, 50
adc edx, 0 # prod += 50
mov ecx, 100
div ecx # can't #DE because (65*0xFFFFFFFF+50)/100 is less than 2^32
# result in eax = 0.65 * input, sort of rounded to nearest but not quite.
mul r32
(4) + add(1) + adc(1) + div r32
(26) = 32 cycles. Note that the dependency chain from input to output include EFLAGS carrying a data dependency from ADD to ADC. There's no parallelism except for the mov-immediate instructions. (And those could be replaced with memory operands to reduce the fused-domain uop count, but that's probably not a win).
* total uops on SKL: mov(1) + mul r32
(3) + add(1) + adc(1) + mov(1) + div r32
(10: microcoded!) = 17 uops
* throughput: Probably bottlenecked on DIV throughput, which is one per 6c on SKL (down from one per 9c on HSW, and one per 11-18c on SnB).
The FP way, for x86-64. (Or with minor changes for signed integers on x86-32). double
can exactly represent every possible integer 32-bit integer, so we can get identical results.
# input in eax
mov edx, eax # zero-extend if you're not sure that the upper bits of rax were zero
cvtsi2sd xmm0, rdx
mulsd xmm0, [scale_factor]
cvtsd2si rax, xmm0
# result in eax = input * 0.65, rounded with the current SSE rounding mode (default = nearest)
section .rodata
scale_factor: dq 0.65
C compiler output:
See source + asm output on the Godbolt Compiler explorer.
Simple way, not handling overflow, and using tricks instead of DIV:
// 65*a can overflow
unsigned scale_int_32bit(unsigned a) {
return (a * 65U + 50) / 100;
}
# clang3.9 -m32 -O3 output
mov eax, dword ptr [esp + 4]
# input in eax
mov edx, 1374389535 # magic constant (modular multiplicative inverse of 100)
mov ecx, eax
shl ecx, 6 # 65 is 64 + 1
lea eax, [eax + ecx + 50] # eax = (a*65 + 50)
mul edx
shr edx, 5 # Do eax/100 with a multiplicative inverse
# result in edx
mov eax, edx
ret
This works for 32-bit, while the FP way didn't.
mul r32
(4) + shr(1) = 9 cyclesmore uops than the FP way, but lower latency. Throughput might be similar.
See this answer for info about lrint(x)
vs. (long)nearbyint(x)
: some compilers do better with one, some compilers inline the other better.
unsigned scale_fp(unsigned a) {
return (a * 0.65);
// nearbyint or lrint to get round-to-nearest,
// but in the asm it's mostly just cvt instead of cvtt.
// return lrint(a * 0.65);
}
# clang3.9 -O3 -m32 -msse2
.LCPI0_0:
.quad 4841369599423283200 # double 4503599627370496
.LCPI0_1:
.quad 4604029899060858061 # double 0.65000000000000002
.LCPI0_2:
.quad 4746794007248502784 # double 2147483648
scale_fp: # @scale_fp
movsd xmm0, qword ptr [.LCPI0_0] # xmm0 = mem[0],zero
movd xmm1, dword ptr [esp + 4] # xmm1 = mem[0],zero,zero,zero
orpd xmm1, xmm0
subsd xmm1, xmm0
movsd xmm0, qword ptr [.LCPI0_2] # xmm0 = mem[0],zero
mulsd xmm1, qword ptr [.LCPI0_1]
movapd xmm2, xmm1
cvttsd2si ecx, xmm1
subsd xmm2, xmm0
cvttsd2si eax, xmm2
xor eax, -2147483648
ucomisd xmm1, xmm0
cmovb eax, ecx
ret
As you can see, converting double
to/from the widest available unsigned integer sucks a lot. ICC and gcc use slightly different strategies. I picked clang's output because it looks shorter, and -fverbose-asm
puts nice comments to tell you the double
value of the FP constants.
This might still be faster than with DIV in 64-bit mode for uint64_t
(because div r64
is much slower), but probably not for uint32_t
in 32-bit mode. (Although note that double
can't exactly represent every uint64_t
. x87 fild
/ fistp
do handle 64-bit integers even in 32-bit mode, and its 80-bit internal FP representation has a 64-bit mantissa (so it can exactly represent every int64_t
; not sure about uint64_t
though.))
You can see x87 versions of this code by compiling with -m32
without -msse2
. Clang enables that by default, it seems, so you could use -mno-sse2
. (And then the rounding-mode changes add a lot of noise if you don't use lrint or nearbyint).
Compiler output for a version that converts the inputs to 64-bit unfortunately sucks.
unsigned scale_int_64bit(unsigned a) {
// gcc, clang and icc don't realize they can do this with one DIV,
// without risk of #DE, so they call __udivdi3
return (a * 65ULL + 50) / 100;
}
Instead of using DIV, compilers call the libgcc function for 64b / 64b division. I'm pretty sure my logic is sound, and that 64b / 32b = 32b DIV can't fault, because of how we generated the input relative to the divisor, so the quotient will fit in 32 bits. It's probably just that compilers didn't manage to prove that, or don't have a pattern to look for opportunities to do that. __udivdi3
does a bunch of checks for the upper half being zero, so it probably would only end up doing a single DIV
anyway (but only after significant branching).