How do modern processors carry out integer arithmetic operations?

Question

This Wikipedia page mentions computation complexity of different mathematical operations, including addition, subtraction, multiplication and division. I'd like to focus on these four.

First of all, every of the mentioned operations has its complexity specified as a function of number of digits. Does that mean that on real hardware adding any two int64_ts will take the same amount of time?

This is an important aspect, since it would allow an attacker to gain some information about e.g. cryptographic keys from sheer observing the encrypting/decrypting party.

Will adding two int32_ts take twice shorter than adding two int64_ts?

Moreover, there are multiple algorithms specified for the multiplication and division operations. Which of these is used in real life processors? We know the asymptotic complexity, but there's the constant too, which is of a big importance.

The Intel Software Developer manual for the IMUL instruction doesn't mention the actual algorithm used, simply states:

TMP_XP ← DEST ∗ SRC

The whole question pertained the x86_64 architecture at the beginning, but I'd be interesting if any other architectures (ARM, Aarch64, POWER) use some different techniques than x86.

See https://stackoverflow.com/questions/15745819/why-is-division-more-expensive-than-multiplication/17371315#17371315 — Lior Kogan, Jul 21 '17 at 17:47

score 2 · Answer 1 · answered Jul 21 '17 at 13:04

Does that mean that on real hardware adding any two int64_ts will take the same amount of time?

If the CPU has a 64-bit wide ALU, yes.

I qualify it like that because there are "modern" processors with 32-bit or smaller ALUs still being designed, largely for the embedded market.

it would allow an attacker to gain some information about e.g. cryptographic keys from sheer observing the encrypting/decrypting party.

I'm not sure timing based side-channel attacks work the way as in your question's premise. If 64-bit math on a given processor requires multiple operations as compared to a true 64-bit version of that processor, all integer math will be slowed down across the entire algorithm, so all an attacker is going to learn is that they're running it on a less capable processor.

Where you get side-channel leakages due to instruction execution rates is where you have if/else branches, and one branch takes longer than the other, so that statistically, the attacker can probe to determine which inputs cause execution of more if clauses than else clauses, and thus glean some information about the key or whatever.

Will adding two int32_ts take twice shorter than adding two int64_ts?

Not necessarily. A 64-bit processor is likely to run both additions in the same time.

If you mean to ask if this will happen on a 32-bit processor, then the answer is "maybe yes", but really, that's something you need to look up in the processor data book. That will give you timing information per instruction.

Your question specifies four different architectures, you're missing at least one key arch (32-bit x86, still extant), and you are missing several other likely ones. (e.g. MIPS.) I'm not prepared to go through every likely processor manual and look this up for you.

The Intel Software Developer manual for the IMUL instruction doesn't mention the actual algorithm used

No, but it should give the timing info in number of clock cycles.

It probably won't be stated that simply, because pipelining, caching and such also play into this.

I'd be interesting if any other architectures (ARM, Aarch64, POWER) use some different techniques than x86.

Sure. There are no hard-and-fast rules in this area.

For example, RISC processors like the ARM tend to take at least 4 instructions to do anything like a multiply because they require a read-calculate-store cycle, since all math has to happen in the processor's registers. (Read operand 1, read operand 2, multiply, store product.)

Contrast a CISC processor which often has memory addressing modes, where a multiply instruction may be coded as "multiply memory location A with memory location B and store in memory location C". The operands still have to be loaded into the CPU and multiplied, and the result still has to be stored, but it looks like a single instruction.

The CISC model also masks things like DRAM reading delays, cache timing issues, etc., which the RISC model makes more explicit.

Once upon a time, processors were simple enough that you could answer such a question easily, but we've been past that point for a few decades now.

score 1 · Answer 2 · answered Jul 21 '17 at 13:49

Does that mean that on real hardware adding any two int64_ts will take the same amount of time?

Yes, the ALU will take the same number of clock cycles to add the numbers. Modern processors have a lot of gates to throw at the problem, so can use very complicated circuits such as spanning tree adders to perform several such operations in a single clock cycle.

Will adding two int32_ts take twice shorter than adding two int64_ts?

It depends, for example the x64 SIMD operations allow addition of four 32 bit integers in a single operation, again with potentially multiple operations per clock cycle. So if your code can be vectorised to use that, you might find adding four pairs of 32-bit integers will take the same time as adding two pairs of 64-bit integers. (The integers won't be int32_t but would use the SIMD vectorized types). If you are using the scalar ALU in x64, then I suspect it will take the same time whether you have 32 or 64 bit numbers in the registers, but can't find a reference.

Moreover, there are multiple algorithms specified for the multiplication and division operations. Which of these is used in real life processors? We know the asymptotic complexity, but there's the constant too, which is of a big importance.

The processors have hardware for the sizes of integers it supports. Modern desktop processors will support multiple such operations per clock cycle, so all the complexity is pushed into more transistors than you can shake a stick at - imagine taking the classic binary multiplier, but the shifts are all parallel and then an efficient addition circuit like the one above, so it ends up performing all the operations on a single cycle.

Architectures with fewer transistors substitute clock cycles. The number of cycles required depends on the storage size of the number, so dividing two 32-bit numbers will always take the same number of cycles.

How do modern processors carry out integer arithmetic operations?

2 Answers2