Is time cost of integer multiplication the same as any binary operation on ARM or Intel processors?

Question

Is the processing time of an integer multiplication the same as any integer binary operation on modern CPU with pipelining (e.g Intel, ARM) ?

In the Assembly documentation of Intel, it is said that an integer multiplication takes 1 cycle, like any integer binary operation. Is this cycle equivalent to the time duration supposing the operations are pipelined ?

What documentation told you that `imul` "takes 1 cycle"? It has 1/clock *throughput*, vs. 4/clock throughput for simple ops like add/sub/and/or/xor. But it also has 3 cycle latency. The cost model for a superscalar out-of-order exec CPU is more complicated than assigning a single number as the cost for an instruction or block; you can't just add up a single number for each instruction to find total cost. [What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?](https://stackoverflow.com/q/51607391) — Peter Cordes, Nov 19 '20 at 13:12
Also [How many CPU cycles are needed for each assembly instruction?](https://stackoverflow.com/q/692718) similarly debunks the idea that it ever makes sense to say something "takes n cycles" without more context. [x86\_64: is IMUL faster than 2x SHL + 2x ADD?](https://stackoverflow.com/a/37925245) covers `imul reg,reg` on x86 CPUs. [RDTSCP in NASM always returns the same value](https://stackoverflow.com/q/54621381) shows how to time a single instruction, like https://uops.info/ does for every x86 instruction on various uarches. — Peter Cordes, Nov 19 '20 at 13:14
Also http://www.lighterra.com/papers/modernmicroprocessors/ is essential reading. — Peter Cordes, Nov 19 '20 at 13:22

score 3 · Answer 1 · answered Nov 19 '20 at 08:47

3

There are more than the cycles to consider:

latency
pipeline

While the results of ALU instructions are instantaneous, multiply instructions have to go through MAC(multiply accumulate) which usually costs more cycles and comes with a latency of multiple cycles.
And often there is only one MAC unit which means the core doesn't allow two mul instructions to be dual issued.

example: ARMv5E:
smulxy(16bit): one cycle plus three cycles latency
mul(32bit): two cycles plus three cycles latency
umull(64bit): three cycles plus four(lower half) and five(upper half) cycles latency

answered Nov 19 '20 at 08:47

Jake 'Alquimista' LEE

6,197
2
17
25

This is a very accessible answer. The multiply implementation can include lookup tables and interpolation methods at a very high cost of silicon, but this brings it near performance for an adder. For the ARM32, there are instructions that are multiplies by fixed constants. ie, `add r0, r2, r2 lsl 2` is a multiply by '5'. Another issue is power; the integer add will always take less die and it will result in less power being used. Combined with single vs multiple MAC, dual issue, etc it is very much dependant on the details of the specific CPU to answer this question completely. – artless noise Nov 19 '20 at 11:03

old_timer · Accepted Answer · 2020-11-19T10:09:41.747

No, multiply is much more complicated than XOR, ADD, OR, NOT, etc. While binary makes it much easier than base 10 you still have to have a larger adder (than just a two operand ADD or other operation).

Take the bits abcd

    abcd
  * 1011
 ========
    abcd
   abcd.
  0000..
+abcd...
=========

In base 10 like grade school you had to multiply each time, you are still multiplying here but only by one or zero so either you copy and shift the first operand or you copy and shift zeros. And it gets very big, addition is cascaded. Look up xor gate at wikipedia and see the full adder or just google it. You have a single column adder for a simple two operand add with three inputs and two outputs but the carry out of one bit is the carry in of the other. No logic is instantaneous even a single transistor inversion (NOT) takes a non-zero amount of time. You can start to think about how many gates are lined up just to make one 32 bit two operand ADD, and then think about a 32 bit multiply where each adder is 32 operand bits and some number of carry bits, and then all of that is cascaded. The chip real estate and the time to settle multiply almost exponentially for multiply, and you then start to worry about can you meet timing (can you settle the msbit of the result within the desired/designed clock speed).

So you will see optimizations made including multiple pipe stages, not 32 clocks to do a 32 bit multiply but maybe not one clock maybe two or four. With a dozen stage deep pipe though you can bury that in there and still meet an advertised one clock per instruction average.

Intel, ARM, etc the 1 cycle thing is an illusion, the math operation itself might take that long, but the execution of the instruction takes a few to a handful, and your pipe depths may be several to a dozen or more. There is limited use in attempting to count cycles these days. And feeding the pipe and handling memory operations tend to dominate the performance not the pipe/instructions themselves outside a carefully crafted sim of the core.

For the cortex-ms which are perhaps not what you are asking about but are very much part of our daily life you see in the documentation that it is the chip vendor that can choose the larger faster multiply or the slower smaller that helps with overall chip size and perhaps performance. (I do not examine the cortex-a docs that much as I do not use them as often) A compile time option when they compile the core, there are many compile time options (which is why for any arm core cortex-m or cortex-a) you cannot compare, say, two cortex-m4s from different vendors or chip families within a vendor as they could have been compiled differently and behave/perform differently (they still execute the enabled instructions in the same functional way of course).

So no you cannot assume the "execution time" or "cycle time" of ANY instruction, and in particular ones like multiply and divide and anything floating point cannot assumed to be single cycle. Yes like all the other instructions the one cycle advertised is based on the pipeline effects, no instruction takes one cycle start to finish, and based on pipe depth of the design the multiply and divide may take more than one clock but be hidden by the pipe to still average one clock per instruction.

Note that this question is "too broad", as there are many Intel and ARM implementations past and present. And chip implementation details are often not available or protected by NDA, all you have if anything are public documents that can hide the reality.

So the conclusion is that a multiplication is more expensive than any binary operation. — chmike, Nov 19 '20 at 10:44
divide is more expensive than multiply but multilpy is more expensive than ADD, AND, XOR, yes both in time to settle and power consumption. But it is the specific implementation of the core in a chip that determines if it settles in one clock and therefore actually takes one clock or if they chose to make it 2 or 3 or 4, etc clocks. — old_timer, Nov 19 '20 at 10:53
You have to examine each core in each chip for a definitive answer. Can I drive from my house to my neighbors house in one day? Yes, can I drive from my house to the next town in one day? For most of us yes. Can I drive from one side of the country to the other in one day? Probably not. So house to house and house to town one is much more expensive (energy cost) but can still be done within a defined time period. So do you consider that more expensive or not by your definition? — old_timer, Nov 19 '20 at 10:56
A read from memory can take dozens to hundreds of clock cycles (even with caches) significantly worse than a multiply. Do you consider that more or less expensive than a multiply. I have no idea what you mean by expensive. Even the fastest read should take longer than a multiply on most systems. — old_timer, Nov 19 '20 at 10:57

Is time cost of integer multiplication the same as any binary operation on ARM or Intel processors?

2 Answers2