Does that mean that on real hardware adding any two int64_t
s will take the same amount of time?
If the CPU has a 64-bit wide ALU, yes.
I qualify it like that because there are "modern" processors with 32-bit or smaller ALUs still being designed, largely for the embedded market.
it would allow an attacker to gain some information about e.g. cryptographic keys from sheer observing the encrypting/decrypting party.
I'm not sure timing based side-channel attacks work the way as in your question's premise. If 64-bit math on a given processor requires multiple operations as compared to a true 64-bit version of that processor, all integer math will be slowed down across the entire algorithm, so all an attacker is going to learn is that they're running it on a less capable processor.
Where you get side-channel leakages due to instruction execution rates is where you have if/else branches, and one branch takes longer than the other, so that statistically, the attacker can probe to determine which inputs cause execution of more if
clauses than else
clauses, and thus glean some information about the key or whatever.
Will adding two int32_t
s take twice shorter than adding two int64_t
s?
Not necessarily. A 64-bit processor is likely to run both additions in the same time.
If you mean to ask if this will happen on a 32-bit processor, then the answer is "maybe yes", but really, that's something you need to look up in the processor data book. That will give you timing information per instruction.
Your question specifies four different architectures, you're missing at least one key arch (32-bit x86, still extant), and you are missing several other likely ones. (e.g. MIPS.) I'm not prepared to go through every likely processor manual and look this up for you.
The Intel Software Developer manual for the IMUL
instruction doesn't mention the actual algorithm used
No, but it should give the timing info in number of clock cycles.
It probably won't be stated that simply, because pipelining, caching and such also play into this.
I'd be interesting if any other architectures (ARM, Aarch64, POWER) use some different techniques than x86.
Sure. There are no hard-and-fast rules in this area.
For example, RISC processors like the ARM tend to take at least 4 instructions to do anything like a multiply because they require a read-calculate-store cycle, since all math has to happen in the processor's registers. (Read operand 1, read operand 2, multiply, store product.)
Contrast a CISC processor which often has memory addressing modes, where a multiply instruction may be coded as "multiply memory location A with memory location B and store in memory location C". The operands still have to be loaded into the CPU and multiplied, and the result still has to be stored, but it looks like a single instruction.
The CISC model also masks things like DRAM reading delays, cache timing issues, etc., which the RISC model makes more explicit.
Once upon a time, processors were simple enough that you could answer such a question easily, but we've been past that point for a few decades now.