Multiply float in Assembly

Question

How can I multiply float in Assembly? I have a value in ebx and want mutiply this for 0.65

mov eax, ebx
mov ecx, 0x0.65 ;; how do this?
mul ecx
mov ebx, eax

If you don't want to use floating point registers and instructions, you could multiply by 65, then divide by 100, round up if the remainder > 50, or round up if remainder == 50 and quotient is an odd number. — rcgldr, Oct 22 '16 at 06:50
You can simplify the round for remainder > 50 or == 50 case somewhat. After the divide, eax = quotient, edx = remainder: mov ecx,1, and ecx,eax, add edx,ecx, cmp edx,50, cmc, adc eax,0. — rcgldr, Oct 22 '16 at 07:03
Google for "x86 floating point multiply" turns up a bunch of results that would point you to the instructions you need. See the [x86 tag wiki](http://stackoverflow.com/tags/x86/info) for insn set reference manual links. — Peter Cordes, Oct 22 '16 at 07:09
Or just ask a compiler how, by writing a C function that does what you want. Then look at the resulting asm. — Peter Cordes, Oct 22 '16 at 07:09
@rcgldr you can just add 50 to the number ahead of div 100 to have the rounding done without any tests. `result = ((eax * 65) + 50) / 100` — Ped7g, Oct 22 '16 at 15:07
@Ped7g: Only if you're sure that `eax*65` won't overflow, or you do it using a full-multiply to produce a 64-bit result and a 64-bit dividend. As a C expression, that isn't what that means. — Peter Cordes, Oct 22 '16 at 19:10
@Ped7g: also, CVTSI2SD / MULSD / CVTSD2SI is probably actually faster than using integer DIV. — Peter Cordes, Oct 22 '16 at 19:35
@Ped7g - depends on what rounding rules you're using. What I mentioned was how to handle a remainder of 50, and one of the rules for this is to round up if the quotient is odd (which is effectively round to nearest even number). Assuming a uniform distribution of quotients, then you round up half of the time if the remainder is 50. — rcgldr, Oct 22 '16 at 21:05
@rcgldr: I use standard math rules, where 0.5 simply rounds to 1. Your up/down treatment of 50 is interesting, but I never needed/saw that in my decades of programming, so it's usefulness must be quite specialized. Thanks for new insight, I overlooked that first time as that additional logic looks heavy and I was lazy to read it fully. — Ped7g, Oct 23 '16 at 04:14
@Ped7g - round to even is a rule used in hardware, banking, and some programming languages. Do a web search for round to nearest even. A couple of links: [wiki rounding](http://en.wikipedia.org/wiki/Rounding#Round_half_to_even) [wiki nearest integer](http://en.wikipedia.org/wiki/Nearest_integer_function) — rcgldr, Oct 23 '16 at 10:55

score 6 · Accepted Answer · edited May 23 '17 at 11:48

Do you want to treat your input as signed or unsigned? If it's signed, converting to double and back is actually easy and fast (with SSE2): look up CVTSI2SD / MULSD / CVTSD2SI in the insn set ref (links in the x86 tag wiki).

It's probably actually faster than using integer IDIV on modern CPUs, but perhaps slower than compiler tricks for dividing by a compile-time constant 100.

However, since you're using MUL, probably your input is unsigned, so it's actually pretty inconvenient to convert to double, at least on a 32-bit machine. On x86-64, you can just zero-extend a 32-bit integer to 64-bit, and then treat that as a 64-bit signed integer.

Put a C function that does what you want on http://gcc.godbolt.org/ and look at the asm output (with -O3)

re: which is faster:

The naive integer way, for 32-bit mode, handling the full range of possible unsigned 32-bit inputs. (Not sure if you can go faster with 64-bit ops, since div r/m64 is much slower than div r/m32 on recent Intel CPUs: see Agner Fog's insn tables.)

# input in eax
mov   ecx, 65
mul   ecx          # edx:eax = prod = eax*65

add   eax, 50
adc   edx, 0       # prod += 50

mov   ecx, 100
div   ecx          # can't #DE because (65*0xFFFFFFFF+50)/100 is less than 2^32
# result in eax = 0.65 * input,  sort of rounded to nearest but not quite.

latency on Skylake: mul r32(4) + add(1) + adc(1) + div r32(26) = 32 cycles.

Note that the dependency chain from input to output include EFLAGS carrying a data dependency from ADD to ADC. There's no parallelism except for the mov-immediate instructions. (And those could be replaced with memory operands to reduce the fused-domain uop count, but that's probably not a win). * total uops on SKL: mov(1) + mul r32(3) + add(1) + adc(1) + mov(1) + div r32(10: microcoded!) = 17 uops * throughput: Probably bottlenecked on DIV throughput, which is one per 6c on SKL (down from one per 9c on HSW, and one per 11-18c on SnB).

The FP way, for x86-64. (Or with minor changes for signed integers on x86-32). double can exactly represent every possible integer 32-bit integer, so we can get identical results.

# input in eax
mov      edx, eax       # zero-extend if you're not sure that the upper bits of rax were zero
cvtsi2sd xmm0, rdx
mulsd    xmm0, [scale_factor]
cvtsd2si rax, xmm0
# result in eax = input * 0.65, rounded with the current SSE rounding mode (default = nearest)

section .rodata
scale_factor: dq 0.65

latency on Skylake: mov(0) + cvt(6) + mulsd(4) + cvt(6) = 16 cycles.
total uops on SKL: mov(1) + cvt(2) + mulsd(1) + cvt(2) = 6 uops.
throughput: probably 1 per 3 or maybe 2 cycles, IDK why CVTSI2SD has a throughput of one per 2c if it's really just one uop for p01 and one uop for p5. Maybe it uses an execution unit that isn't fully pipelined? Haswell is listed with even worse throughputs for CVTSI2SD.

C compiler output:

See source + asm output on the Godbolt Compiler explorer.

Simple way, not handling overflow, and using tricks instead of DIV:

// 65*a can overflow
unsigned scale_int_32bit(unsigned a) {
  return (a * 65U + 50) / 100;
}

# clang3.9 -m32 -O3 output
    mov     eax, dword ptr [esp + 4]

    # input in eax
    mov     edx, 1374389535         # magic constant (modular multiplicative inverse of 100)
    mov     ecx, eax
    shl     ecx, 6                  # 65 is 64 + 1
    lea     eax, [eax + ecx + 50]   # eax = (a*65 + 50)

    mul     edx
    shr     edx, 5                  # Do eax/100 with a multiplicative inverse
    # result in edx

    mov     eax, edx
    ret

This works for 32-bit, while the FP way didn't.

SKL latency: mov(0) + shl(1) + lea-with-3-components(3) + mul r32(4) + shr(1) = 9 cycles

more uops than the FP way, but lower latency. Throughput might be similar.

See this answer for info about lrint(x) vs. (long)nearbyint(x): some compilers do better with one, some compilers inline the other better.

unsigned scale_fp(unsigned a) {
  return (a * 0.65);
  // nearbyint or lrint to get round-to-nearest,
  // but in the asm it's mostly just cvt instead of cvtt.
  // return lrint(a * 0.65);
}

# clang3.9 -O3 -m32 -msse2

.LCPI0_0:
    .quad   4841369599423283200     # double 4503599627370496
.LCPI0_1:
    .quad   4604029899060858061     # double 0.65000000000000002
.LCPI0_2:
    .quad   4746794007248502784     # double 2147483648
scale_fp:                           # @scale_fp
    movsd   xmm0, qword ptr [.LCPI0_0] # xmm0 = mem[0],zero
    movd    xmm1, dword ptr [esp + 4] # xmm1 = mem[0],zero,zero,zero
    orpd    xmm1, xmm0
    subsd   xmm1, xmm0
    movsd   xmm0, qword ptr [.LCPI0_2] # xmm0 = mem[0],zero
    mulsd   xmm1, qword ptr [.LCPI0_1]
    movapd  xmm2, xmm1
    cvttsd2si       ecx, xmm1
    subsd   xmm2, xmm0
    cvttsd2si       eax, xmm2
    xor     eax, -2147483648
    ucomisd xmm1, xmm0
    cmovb   eax, ecx
    ret

As you can see, converting double to/from the widest available unsigned integer sucks a lot. ICC and gcc use slightly different strategies. I picked clang's output because it looks shorter, and -fverbose-asm puts nice comments to tell you the double value of the FP constants.

This might still be faster than with DIV in 64-bit mode for uint64_t (because div r64 is much slower), but probably not for uint32_t in 32-bit mode. (Although note that double can't exactly represent every uint64_t. x87 fild / fistp do handle 64-bit integers even in 32-bit mode, and its 80-bit internal FP representation has a 64-bit mantissa (so it can exactly represent every int64_t; not sure about uint64_t though.))

You can see x87 versions of this code by compiling with -m32 without -msse2. Clang enables that by default, it seems, so you could use -mno-sse2. (And then the rounding-mode changes add a lot of noise if you don't use lrint or nearbyint).

Compiler output for a version that converts the inputs to 64-bit unfortunately sucks.

unsigned scale_int_64bit(unsigned a) {
  // gcc, clang and icc don't realize they can do this with one DIV,
  // without risk of #DE, so they call __udivdi3
  return (a * 65ULL + 50) / 100;
}

Instead of using DIV, compilers call the libgcc function for 64b / 64b division. I'm pretty sure my logic is sound, and that 64b / 32b = 32b DIV can't fault, because of how we generated the input relative to the divisor, so the quotient will fit in 32 bits. It's probably just that compilers didn't manage to prove that, or don't have a pattern to look for opportunities to do that. __udivdi3 does a bunch of checks for the upper half being zero, so it probably would only end up doing a single DIV anyway (but only after significant branching).

Don't modern CPU's use a fast algorithm for division by integers, something similar to Newton–Raphson for 1/divisor, normalizing, table lookup and 4 internal multiplies for 32 bit numbers? As you mentioned, a compiler might optimize dividing by a constant with a single multiply and shift. — rcgldr, Oct 22 '16 at 21:12
@rcgldr: DIV is still slower than converting to FP and back, unless you have to jump through hoops to go from/to the widest unsigned type the machine supports. I was going to just comment, then decided to update my answer, and then realize I just spent ~1.5 hours writing that :P — Peter Cordes, Oct 22 '16 at 23:24

Multiply float in Assembly

1 Answers1