5

Is the comparison of 32 bits faster than the comparison of 64 bits?

I was looking at this file http://www.netlib.org/fdlibm/s_cos.c

They have this piece of code

    /* |x| ~< pi/4 */
    ix &= 0x7fffffff;
    if(ix <= 0x3fe921fb) return __kernel_cos(x,z);

I understand the first line, which calculates the absolute value of x. But why is the comparison so complicated? Is there any improvement in performance by comparing the first 32 bits and not all 64 bits? Could I write

long unsigned int ix = *(long unsigned int *)(&x);
ix &= 0x7fffffffffffffff;
if (ix < 0x3fe921fb54442d18) 
/* What comes next */ ;

and expect the same performance in terms of speed on a 64-bit machine? Though I agree this would consume more memory.

0x3fe921fb54442d18 is pi/2.

user16217248
  • 3,119
  • 19
  • 19
  • 37
Sayan
  • 99
  • 4
  • First thing to note is that `x` is a `double`. The bit pattern is not the same as an integer. Comparisons may or may not be faster, but that has nothing to do with this code. – OrangeDog Apr 16 '21 at 11:40
  • 4
    It depends upon the target architecture and processor. – Basile Starynkevitch Apr 16 '21 at 11:41
  • 1
    Things have likely changed since that code was written (1995). Direct testing of `fabs(x)` will likely be (almost) as efficient on modern chips. – Adrian Mole Apr 16 '21 at 11:45
  • Atomic operations regardless of size probably do not make a difference in speed. Comparing eax,edx would be the same speed as comparing rax,rdx although rax,rdx might actually be faster because the internal pipeline of a 64 bit chip.. – Irelia Apr 16 '21 at 11:46
  • 2
    As Adrian Mole noted, this is very old code (Copyright 1993). In particular, it's way older than 64 bit CPU's. Comparing 64 bits on a 32 bit CPU is indeed slower, for obvious reasons. – MSalters Apr 16 '21 at 14:29
  • @Nina: Unlikely, any x64 has hardware to handle `eax`. The really interesting alias is `ah`, which is bits 8-15 of `rax`. (not starting at the LSB). – MSalters Apr 16 '21 at 14:32
  • The comparison in the code is for a specific 32-bit floating-point representation of the value of pi/4. This value is used in the range reduction of the input to the cosine function, which means the function is only called for inputs in the range [0, pi/4]. Comparing the lower 32 bits is faster than comparing all 64 bits as it can be done in a single operation on most processors, while a full 64-bit comparison would require multiple operations. The comparison you proposed would work on a 64-bit machine, but it would use more memory and may not be faster as it would require multiple operations. – Kerasiotis Ioannis Feb 08 '23 at 08:49

1 Answers1

0

On my 64-bit Intel machine with Apple clang version 12.0.0 I tried running ix <= 0x3fe921fb where ix is int typed over a billion times several times. I then tried running ix < 0x3fe921fb54442d18 with a unsigned long typed ix the same number of times. Here are the results in seconds:

No optimization 32-bit:

1.470922
1.448247
1.449718
1.446084
1.450020
1.453608

No optimization 64-bit:

1.567637
1.561653
1.565024
1.575094
1.567794
1.564141

-O3 32 bit:

0.421903
0.419469
0.425281
0.419894
0.425790
0.424800

-03 64-bit:

0.636965
0.640522
0.637279
0.634344
0.634989
0.633755

The 32-bit comparisons, at least on my machine, are consistently slightly faster.

I would go for the first option. Faster, even on 64-bit machines, regardless of optimization setting, and also uses less memory, and power, as the 32-bit comparison circuits are smaller and use less energy. Please also note that long unsigned int ix = *(long unsigned int *)(&x); with a double typed x is technically undefined behavior.

Test Code:

volatile int ix = 0; // Changing this value has no effect
clock_t before = clock();
for (int i = 1<<30; i--;) {
    volatile int sink = ix <= 0x3fe921fb;
}
printf("%f\n", (double)(clock()-before)/CLOCKS_PER_SEC);
volatile unsigned long ix = 0;
clock_t before = clock();
for (int i = 1<<30; i--;) {
    volatile int sink = ix < 0x3fe921fb54442d18;
}
printf("%f\n", (double)(clock()-before)/CLOCKS_PER_SEC);
user16217248
  • 3,119
  • 19
  • 19
  • 37
  • 1
    What test code did you use, compiled how? Or did you write by hand in asm so you could compare what an optimizing compiler might do once? The biggest difference is that it takes more / larger instructions to put a 64-bit constant in a register, like `mov rax, 0x3fe921fb54442d18` (10 bytes on x86-64), vs. a 32-bit immediate being directly usable as an immediate for `cmp ecx, 0x3fe921fb` (or against a 32-bit memory operand). – Peter Cordes May 23 '23 at 05:48
  • 1
    But yeah, 32-bit operand-size is generally better when it's possible, on x86-64. Slightly smaller code-size even when you don't have arbitrary constants. [The advantages of using 32bit registers/instructions in x86-64](https://stackoverflow.com/q/38303333) . That's not t rue on AArch64 or most other 64-bit ISAs, where code-size doesn't differ except when you need to materialize a 64-bit constant. I assume however you compiled your test, the compiler couldn't put the constant in a register once outside a loop? – Peter Cordes May 23 '23 at 05:52
  • @PeterCordes I have now provided the test code, and tried with both optimizations, and without, and got faster speeds, but the 32-bit remains faster. – user16217248 May 23 '23 at 06:04
  • 1
    Ok, that lets the compiler hoist the 64-bit constant out of the loop; https://godbolt.org/z/oes86r5W4 shows the work per compare is loading the 64-bit volatile `ix`, compare, and booleanizing the result into a 32-bit 0 or 1 (which is non-trivial on x86, taking 2 extra instructions beyond making an 8-bit 0/1 unfortunately). – Peter Cordes May 23 '23 at 06:13
  • 1
    Anyway, since it can hoist the constant out of the loop, and there's no I-cache pressure from code size, I'd expect both loops to run at the same speed except for quirks of code alignment (especially on Skylake if you don't use [How can I mitigate the impact of the Intel jcc erratum on gcc?](https://stackoverflow.com/q/61256646)) – Peter Cordes May 23 '23 at 06:13
  • @PeterCordes So even though in C booleans are just `int`, in x86 they're status flags, and require instructions to read the status flags into an `int`, right? – user16217248 May 23 '23 at 06:20
  • 1
    Yes, only a few architectures (notably MIPS and RISC-V) have compare instructions that compare into 0/1 integers in general-purpose integer registers. x86 has an instruction (`setcc`) to write an 8-bit register or memory location with a 0/1 according to a FLAGS condition, but widening that to 32-bit takes yet another instruction, preferably xor-zeroing the full register before you even do the compare. AArch64 is much better, able to materialize a compare result into a 0/1 in one more instruction with the flexible and cleverly designed `csinc` (conditional-select inc) insn on the zero reg. – Peter Cordes May 23 '23 at 06:24
  • 1
    @PeterCordes So on x86 I should **not** use `int` to store booleans then, because that would cause extra widening instructions. I should probably just use `bool`/`_Bool` but if not that then `uint_fast8_t`. – user16217248 May 23 '23 at 06:26
  • 1
    Right, that would save at least one uop. I checked on my desktop (i7-6700k Skylake at 3.9GHz), with `gcc -O3 -Wa,-mbranches-within-32B-boundaries comp.c` (GCC13.1) for this code wrapped in a function (https://godbolt.org/z/nE9M5GTrP), and as expected I got identical times for 32-bit and 64-bit compare, best case 0.413106 sec for 64-bit, 0.413146 sec for 32-bit compares. The throughput cost would be different for a case like the OP's where the function isn't done in a loop, so the integer constant needs to be loaded or generated from immediates somehow once per use. – Peter Cordes May 23 '23 at 06:32
  • Of course on a modern system you'd just do an FP compare, like x86-64 `ucomisd` (after an `andps` to do absolute value), especially if you didn't already need parts of the FP bit-pattern in integer registers. That's faster than `movq rax, xmm0` / `shr rax, 32` / `and eax, 0x7fffffff` / `cmp eax, 0x3fe921fb`. – Peter Cordes May 23 '23 at 06:35
  • 1
    https://godbolt.org/z/a3EjKWhqo shows the loop with `volatile _Bool sink = ...`. Clang chooses `setb byte ptr [rsp + 7]` (which decodes to 2 uops for the front-end (fused-domain) on most CPUs, or even 3 on some more recent Intel. https://uops.info/) GCC does `setb` into an 8-bit register and stores that. It has a false dependency on the old value of the register (since 8-bit register writes leave the rest of the reg unmodified), but on Skylake at least the loop bottlenecks on front-end throughput since the pipeline is 4-wide at its narrowest, and the loop is 5 uops. https://uica.uops.info/ – Peter Cordes May 23 '23 at 06:38