Performance 32 bit vs. 64 bit arithmetic

Question

Are native 64 bit integer arithmetic instructions slower than their 32 bit counter parts (on x86_64 machine with 64 bit OS)?

Edit: On current CPUs such Intel Core2 Duo, i5/i7 etc.

@Cody: Oh really? You claim 64-bit integer divide is as fast as 32-bit integer divide? — Ben Voigt, Jan 20 '12 at 23:03
You're both right. Read David Schwartz's explanation below. Executing an instruction in the CPU'S ALU is one thing. Getting the operands into the CPU, and getting the result back out of the CPU, is another thing. — paulsm4, Jan 20 '12 at 23:07
Related: [The advantages of using 32bit registers/instructions in x86-64](https://stackoverflow.com/q/38303333) — Peter Cordes, Feb 02 '21 at 03:25

David Schwartz · Accepted Answer · 2012-01-20T23:05:58.303

55

It depends on the exact CPU and operation. On 64-bit Pentium IVs, for example, multiplication of 64-bit registers was quite a bit slower. Core 2 and later CPUs have been designed for 64-bit operation from the ground up.

Generally, even code written for a 64-bit platform uses 32-bit variables where values will fit in them. This isn't primarily because arithmetic is faster (on modern CPUs, it generally isn't) but because it uses less memory and memory bandwidth.

A structure containing a dozen integers will be half the size if those integers are 32-bit than if they are 64-bit. This means it will take half as many bytes to store, half as much space in the cache, and so on.

64-bit native registers and arithmetic are used where values may not fit into 32-bits. But the main performance benefits come from the extra general purpose registers available in the x86_64 instruction set. And of course, there are all the benefits that come from 64-bit pointers.

So the real answer is that it doesn't matter. Even if you use x86_64 mode, you can (and generally do) still use 32-bit arithmetic where it will do, and you get the benefits of larger pointers and more general purpose registers. When you use 64-bit native operations, it's because you need 64-bit operations, and you know they'll be faster than faking it with multiple 32-bit operations -- your only other choice. So the relative performance of 32-bit versus 64-bit registers should never be a deciding factor in any implementation decision.

edited Jan 20 '12 at 23:05

answered Jan 20 '12 at 23:00

David Schwartz

179,497
17
214
278

Thank you, but I was here concerned only about CPU cycles. Cache-miss issues and similar are absolutely OK, but that's another story. – Cartesius00 Jan 20 '12 at 23:05
Then my first sentence answers your question. But my larger point is that it doesn't matter. 64-bit arithmetic isn't used where 32-bit arithmetic would do. So the relative performance should never be a determining factor in any decision. – David Schwartz Jan 20 '12 at 23:07
Do you have any concrete examples? Let's say on Core2 Duo. Or link? – Cartesius00 Jan 20 '12 at 23:08
3

See particularly page 47 of this [instruction timing table](http://www.agner.org/optimize/instruction_tables.pdf). – David Schwartz Jan 20 '12 at 23:14
Nice answer. In general I would expect multiplication and division to be considerably slower, and everything else to be the same speed.. – R.. GitHub STOP HELPING ICE Jan 21 '12 at 00:11
1

I am implementing a multi-threaded prime sieve, and am using a bitfield to represent the results. I arrived at this site wondering what size integers to use as the underlying storage, and came away without a meaningful answer. "Relative performance should never be a determining factor in any decision" is false in this particular case. I chose 32-bit, to reduce the number of potential collisions when multi-threading, and to reduce memory bandwidth. However, 64-bit may have ultimately been more efficient since I am also using popcount and bitscan operations to iterate/locate results. – Jeff G May 29 '17 at 18:47

score 11 · Answer 2 · answered Aug 28 '15 at 21:47

I just stumbled upon this question, but I think one very important aspect is missing here: if you really look down into assembly code using the type 'int' for indices will likely slow down the code your compiler generates. This is because 'int' defaults to a 32bit type on many 64bit compilers and platforms (Visual Studio, GCC) and doing address calculations with pointers (which are necessarily 64bit on a 64bit OS) and 'int' will cause the compiler to emit unnecessary conversions between 32 and 64bit registers. I've just experienced this in a very performance critical inner loop of my code. Switching from 'int' to 'long long' as loop index improved my algorithm run time by about 10%, which was quite a huge gain considering the extensive SSE/AVX2 vectorization I was already using at that point.

Could you provide access to that specific example? I'm currently in the same dilemma (heavy AVX2 vectorization, etc.) and I'm unsure whether it's worth it to, an addition, pay attention to the type of loop indices. — étale-cohomology, Aug 31 '16 at 22:40

score 1 · Answer 3 · answered Jul 18 '13 at 06:39

In a primarily 32-bit application (meaning only 32-bit arithmetic is used, and 32-bit pointers are sufficient), the real benefits of the x86-64 architecture are the other "updates" AMD made to the architecture:

16 general-purpose registers, up from 8 in x86
RIP-relative addressing mode
others...

This is evident by the new x32 ABI implemented in Linux.

Performance 32 bit vs. 64 bit arithmetic

3 Answers3

Linked