Why is GCC subtracting 1 and comparing <= 2? Is cmp faster with powers of two in assembly?

Question

I was writing some code to clear the screen to a particular color. C++ code:

void clear_screen(unsigned int color, void *memory, int height, int width) {
  unsigned int *pixel = (unsigned int *)memory;
  for (auto y = 0; y < height; y++)
    for (auto x = 0; x < width; x++)
      *pixel++ = color;
}

I used g++ and objconv to generate the corresponding assembly. This is what I got, and I've commented what I think some of the lines do too.

renderer_clear_screen:
        push    r13                                     
        push    r12                                     
        push    rbp                                     
        push    rdi                                     
        push    rsi                                     
        push    rbx                                     
        mov     r11d, ecx            ; move the color into r11d
        mov     ebx, r8d             ; move the height into ebx
        mov     rcx, rdx             ; 000E _ 48: 89. D1st
        test    r8d, r8d             ; 
        jle     _cls_return          ; basically, return if width or height is 0
        test    r9d, r9d             ; ( window minimized )
        jle     _cls_return          ;
        mov     r8d, r9d             ; height = width
        mov     esi, r9d             ; esi = width
        mov     edi, r9d             ; edi = width
        xor     r10d, r10d           ; r10d = 0
        shr     esi, 2               ; esi = width / 2
        movd    xmm1, r11d           ; move the lower 32-bits of the color into xmm1
        lea     r12d, [r9-1]         ; r12d = width - 1
        shl     rsi, 4               ; 003F _ 48: C1. E6, 04
        mov     ebp, r8d             ; 0043 _ 44: 89. C5
        shl     rdi, 2               ; 0046 _ 48: C1. E7, 02
        pshufd  xmm0, xmm1, 0        ; 004A _ 66: 0F 70. C1, 00
        shl     rbp, 2               ; 004F _ 48: C1. E5, 02

ALIGN   8
?_001:  cmp     r12d, 2                                
        jbe     ?_006                ; if (width - 1 <= 2) { ?_006 }
        mov     rax, rcx             ; 005E _ 48: 89. C8
        lea     rdx, [rcx+rsi]       ; 0061 _ 48: 8D. 14 31

ALIGN   8
?_002:  movups  oword [rax], xmm0    ; 0068 _ 0F 11. 00
        add     rax, 16              ; 006B _ 48: 83. C0, 10
        cmp     rdx, rax             ; 006F _ 48: 39. C2
        jnz     ?_002                ; 0072 _ 75, F4
        lea     rdx, [rcx+rbp]       ; 0074 _ 48: 8D. 14 29
        mov     eax, r8d             ; 0078 _ 44: 89. C0
        cmp     r9d, r8d             ; 007B _ 45: 39. C1
        jz      ?_004                ; 007E _ 74, 1C
?_003:  lea     r13d, [rax+1H]       ; 0080 _ 44: 8D. 68, 01
        mov     dword [rdx], r11d    ; 0084 _ 44: 89. 1A
        cmp     r13d, r9d            ; 0087 _ 45: 39. CD
        jge     ?_004                ; 008A _ 7D, 10
        add     eax, 2               ; 008C _ 83. C0, 02
        mov     dword [rdx+4H], r11d ; 008F _ 44: 89. 5A, 04
        cmp     r9d, eax             ; 0093 _ 41: 39. C1
        jle     ?_004                ; 0096 _ 7E, 04
        mov     dword [rdx+8H], r11d ; 0098 _ 44: 89. 5A, 08
?_004:  add     r10d, 1              ; 009C _ 41: 83. C2, 01
        add     rcx, rdi             ; 00A0 _ 48: 01. F9
        cmp     ebx, r10d            ; 00A3 _ 44: 39. D3
        jnz     ?_001                ; 00A6 _ 75, B0
_cls_return: 
        pop     rbx                  ;
        pop     rsi                  ;
        pop     rdi                  ;
        pop     rbp                  ;
        pop     r12                  ;
        pop     r13                  ; pop all the saved registers
        ret                          ; 

?_006:  ; Local function
        mov     rdx, rcx             ; 00B1 _ 48: 89. CA
        xor     eax, eax             ; 00B4 _ 31. C0
        jmp     ?_003                ; 00B6 _ EB, C8

Now, in ?_001, the compiler compares width - 1 to 2, which is the same thing as comparing the width to 3. My question is, with -O3, why did the compiler choose two instead of three, and waste a lea (to move width - 1 into r12d).
The only thing which makes sense to me is that powers of two are somehow faster to compare. Or maybe it's a compiler quirk?

```g++ (tdm64-1) 10.3.0```, compiled like this: ```g++ -c clear.cpp -O3``` and ```objconv -fnasm clear.o```. You won't be able to reproduce it in compiler explorer, because my question doesn't use intel asm, its nasm asm. — avighnac, Jun 02 '22 at 20:48
OK, compiler explorer doesn't have a Windows target for GCC I think, but the one for Linux gives a similar looking result: https://godbolt.org/z/fzbv9TzdK The function entry and exit are a bit different, but otherwise similar. (The assembly dialect isn't so important. Just that I can reproduce the relevant parts.) — user17732522, Jun 02 '22 at 20:51
Right, that is pretty similar. BTW, how did you make it generate nasm like syntax? — avighnac, Jun 02 '22 at 21:03
That's just intel syntax. In compiler explorer you can choose it under "Output". GCC allows choosing the syntax to provide assembly output in. You don't need to disassemble the generated object file for that. See e.g. [this question](https://stackoverflow.com/questions/199966/how-do-you-use-gcc-to-generate-assembly-code-in-intel-syntax). — user17732522, Jun 02 '22 at 21:05
To test with powers of 2, you still need at least 2 instructions: bit test and a jump on condition. The numeric compare requires two instructions. At the microcode level, the bit-test may be faster because it can bypass the ALU. However, we're talking in the realm of nanoseconds here. A cache miss or an instruction cache reload would take (waste) more time. — Thomas Matthews, Jun 02 '22 at 22:18
There doesn't seem to be a reason. Sometimes the compiler just does things because that is how the code ends up being produced and it isn't cheaper to do it another way. The lea might just give a nice alignment for the loop and removing it would add a NOP and gain nothing. — Goswin von Brederlow, Jun 02 '22 at 23:38
No, it's not faster; `cmp` performance is not data-dependent at all. (No integer instructions are, except `[i]div`, or on some old CPUs `mul`/`imul`. And on AMD CPUs before Zen3, `pext` / `pdep`. But anyway, not simple integer stuff. See https://uops.info/). — Peter Cordes, Jun 03 '22 at 03:13

score 4 · Accepted Answer · answered Jun 03 '22 at 03:50

The usual reason for GCC tweaking compare constants is to create smaller immediates, which helps it fit in an immediate of whatever width. Understanding gcc output for if (a>=3) / GCC seems to prefer small immediate values in comparisons. Is there a way to avoid that? (It always does it, instead of checking whether it's actually useful with this constant on the target ISA.) This heuristic works well for most ISAs, but sometimes not for AArch64 or ARM Thumb which can encode some immediates as a bit-range / bit-pattern, so it's not always the case that a smaller-magnitude number is better.

The width-1 is not part of that. The -1 is part of a range check to skip the auto-vectorized loop (16 bytes at a time with movups) and go straight to the cleanup, 1..3 scalar stores.

It seems to be checking width >= 1 && width <= 3, i.e. cleanup needed but total size is less than a full vector width. It's not equivalent to signed or unsigned width <= 3 for width=0. Note the unsigned compare: 0 - 1 is above 2U, because -1U is UINT_MAX.

But it already excluded width <= 0 with test r9d, r9d / jle _cls_return, so it would have been better for GCC to just check width <= 3U instead of doing extra work to exclude zero from the range-check. (An lea, and save/restore of R12 which isn't otherwise used!)

(The cleanup could also looks over-complicated, e.g. using movq [rdx], xmm0 if more than 1 uint is needed, and some weird branching around for various cases. And even better, if the total size is >= 4 uints, just do another movups that ends at the end of the range, possibly overlapping with previous stores.)

Yes, this is a missed optimization, you can report it on https://gcc.gnu.org/bugzilla/enter_bug.cgi?product=gcc (now that you know it's a missed optimization; it's good that you asked here first instead of filing a bug without first figuring out if the instruction could be avoided.)

The only thing which makes sense to me is that powers of two are somehow faster to compare.

No, it's not faster; cmp performance is not data-dependent at all. (No integer instructions are, except sometimes [i]div. And on AMD CPUs before Zen3, pext / pdep. But anyway, not simple integer add/compare/shift stuff. See https://uops.info/).

And BTW, we can reproduce your GCC asm output on Godbolt by telling it this function is __attribute__((ms_abi)), or there's a command-line option to set the calling convention default. (It's really only useful for looking at the asm; it's still using GNU/Linux headers and x86-64 System V type widths like 64-bit long. Only a proper MinGW (cross-)compiler could show you what GCC would really do when targeting Windows.)

It's GAS .intel_syntax noprefix, which is MASM-like, not NASM, but the difference would only be obvious with addressing modes involving global variables.

Why is GCC subtracting 1 and comparing <= 2? Is cmp faster with powers of two in assembly?

1 Answers1