Why benchmark performance of 64bits multiplication is comparable to 32bits multiplication?

Question

This is extension of this question. I am using rust for these benchmarks. Performance of 64bits multiplication is equal to 32bit multiplication. IN previous question people suggested to use benchmarking and after using that I am still getting same performance. Please note I am new with rust so please let me know if I am doing anything wrong. Here is my simple bench marking file

use criterion::{black_box, criterion_group, criterion_main, Criterion};

pub fn test_64_mul(test_num: u64){    
    for _ in 1..20000{
        let mut _prod = test_num as u128 * test_num as u128;
 
    }
}

pub fn test_32_mul(test_num: u32){
    for _ in 1..20000{
        let mut _prod = test_num as u64 * test_num as u64;
 
    }
}

fn criterion_benchmark(c: &mut Criterion) {
    c.bench_function("mul 64", |b| b.iter(|| test_64_mul(black_box(12345678653435363454))));
    c.bench_function("mul 32", |b| b.iter(|| test_64_mul(black_box(1234565755))));
}

criterion_group!(benches, criterion_benchmark);
criterion_main!(benches);

Now when I do cargo bench output is:

mul 64 time: [312.47 ps 312.66 ps 312.93 ps]

mul 32 time: [312.56 ps 312.75 ps 312.99 ps]

score 0 · Answer 1 · answered May 25 '22 at 08:36

0

You are using criterion incorrectly. The compiler will still eliminate the multiplication.

For it to work, you need the result of the functions to depend on the arguments to them. The easiest way is to write use one round and let criterion worry about the rest:

use criterion::{black_box, criterion_group, criterion_main, Criterion};

pub fn test_64_mul(test_num: u64) -> u128 {
    test_num as u128 * test_num as u128
}

pub fn test_32_mul(test_num: u32) -> u64 {
    test_num as u64 * test_num as u64
}

fn criterion_benchmark(c: &mut Criterion) {
    c.bench_function("mul 64", |b| b.iter(|| test_64_mul(black_box(12345678653435363454))));
    c.bench_function("mul 32", |b| b.iter(|| test_64_mul(black_box(1234565755))));
}

See the user guide.

answered May 25 '22 at 08:36

Chayim Friedman

47,971
5
48
77

I copy pasted your code into my benchmark file and still got the same result – CryptoKitty May 25 '22 at 08:42
Although I agree it looks like compiler is still just ignoring it because result is same between my file and your file without loop – CryptoKitty May 25 '22 at 08:43
@muhammadharis You need to inspect the assembly. – Chayim Friedman May 25 '22 at 08:44
I am not very good with assembly , there is no other way to turn off these compiler optimizations at least for testing? – CryptoKitty May 25 '22 at 08:46
@muhammadharis Turning off compiler optimizations will give completely meaningless benchmarking. But you can ask us here to help you if you provide the assembly. – Chayim Friedman May 25 '22 at 08:49
1

@muhammadharis Note that your basic assumption, that 128-bit multiplication is slower, may be false. For instance, it is false on x86. – Chayim Friedman May 25 '22 at 08:51
Can you please explain why it could be false because for 64bit muls we have to do something like this https://stackoverflow.com/a/28904636/4612998 and also here is the assembly https://gcc.godbolt.org/z/oa9jEbGrj – CryptoKitty May 25 '22 at 08:54
1

@muhammadharis Your assembly is unoptimized. [This](https://gcc.godbolt.org/z/vjz5YaYqh) is the correct one. As you can see, the difference is `mul` (unsigned) vs `imul` (signed, irrelevant in this case) and 64 vs 32 bit operand width. In x86, the mul instructions has wide results - they return value in **two** registers, causing `i128` multiplication to be just one instruction. The fact that the C code required to do the same thing is longer does not matter: C does not have a 128 bit integer, and the compiler may optimize this out (or may not). – Chayim Friedman May 25 '22 at 08:58
1

ARM does use multiple instructions for wide multiplication, but it is very possible that they are fused into 1μop. I'm not proficient enough in ARM to be sure. – Chayim Friedman May 25 '22 at 08:59
just to confirm that I get it , you saying that in x86 multiplication of two 64 bit numbers has wide results but still one instruction that uses same number of cycles as 32 bit numbers? I feel at least 32 bits multiplication should be slightly cheaper because we multiplying 32 bits vs 64 bits? – CryptoKitty May 25 '22 at 09:14
@muhammadharis This does not make a difference at the hardware level. And yes, [Agner Fog's optimization manual](https://www.agner.org/optimize/instruction_tables.pdf) claim that the `mul` is 2μop, the `imul` is one, but both are three cycles latency (on Ice Lake). – Chayim Friedman May 25 '22 at 09:19
@muhammadharis: It all comes down to how the hardware and microops is implemented. However the general rule is, that as long as the operands do fully fit into native registers, multiplication complexity is independent of the width. On the µOP level the unused bits of shorter multiplications will be masked out. But on the silicon level there's just one multiplication unit that's used for all the different variants and its wiring doesn't care to which length the result is truncated. The bulk of the µOPs stems from implementing the quirky details of the ISA. – datenwolf May 25 '22 at 09:33
1

@muhammadharis: If you want a more hands-on understanding of why that is, I suggest as an exercise to implement the logic of a multiplication unit yourself, targeting an FPGA or a simulator. Look for texts on implementing an ALU in VHDL or Verilog. – datenwolf May 25 '22 at 09:38

Why benchmark performance of 64bits multiplication is comparable to 32bits multiplication?

1 Answers1