C++ assembly why isn't this "source 4 bytes to dest 4 bytes" assignment vectorized by g++?

Question

#include <stdio.h>
#include <iostream>
#include <random>
using namespace std;

volatile int res = 0;

void copy(char* __restrict__ dst, char* __restrict__ src) {
    dst[0] = src[0];
    dst[1] = src[1];
    dst[2] = src[2];
    dst[3] = src[3];
}

void copyOffset(char* __restrict__ dst, char* __restrict__ src, size_t offset) {
    dst[0] = src[offset + 0];
    dst[1] = src[offset + 1];
    dst[2] = src[offset + 2];
    dst[3] = src[offset + 3];
}

void copyAsInt(char *dst, char *src) {
    *((int*)dst) = *((int*)src);
}

//----
void copy16(char* __restrict__ dst, char* __restrict__ src) {
    dst[0] = src[0];
    dst[1] = src[1];
    dst[2] = src[2];
    dst[3] = src[3];
    dst[4] = src[4];
    dst[5] = src[5];
    dst[6] = src[6];
    dst[7] = src[7];
    dst[8] = src[8];
    dst[9] = src[9];
    dst[10] = src[10];
    dst[11] = src[11];
    dst[12] = src[12];
    dst[13] = src[13];
    dst[14] = src[14];
    dst[15] = src[15];    
}

void copyOffset16(char* __restrict__ dst, char* __restrict__ src, size_t offset) {
    dst[0] = src[offset + 0];
    dst[1] = src[offset + 1];
    dst[2] = src[offset + 2];
    dst[3] = src[offset + 3];
    dst[4] = src[offset + 4];
    dst[5] = src[offset + 5];
    dst[6] = src[offset + 6];
    dst[7] = src[offset + 7];
    dst[8] = src[offset + 8];
    dst[9] = src[offset + 9];
    dst[10] = src[offset + 10];
    dst[11] = src[offset + 11];
    dst[12] = src[offset + 12];
    dst[13] = src[offset + 13];
    dst[14] = src[offset + 14];
    dst[15] = src[offset + 15];    
}

int main() {
    char *a = new char[1001], *b = new char[16];
    
    //--- which pair of statements below is unsafe or not equal each other?
    copyOffset(b, a, 20);
    res = b[rand() % 4]; // use b[] for something to prevent optimization

    copy(b, &a[20]);
    res = b[rand() % 4];

    //--- non 4 bytes aligned
    copyOffset(b, a, 18);
    res = b[rand() % 4];

    copy(b, &a[18]);
    res = b[rand() % 4];

    //---
    copyOffset16(b, a, 26);
    res = b[rand() % 16];

    copy(b, &a[26]);
    res = b[rand() % 16];

    return 1;
}

I'm trying to copy 4 bytes (both source and dest are ensured to be allocated). However, the source address might not be 4 bytes aligned. To copy 4 bytes, I expect the compiler to emit a copy DWORD instruction like in copyAsInt(). I'm using -O3 -mavx flag, and use godbolt with gcc 11.2 to see assembly code.

The function copy() is translated to be the same as copyAsInt(), as expected. However, for some reason, the function copyOffset() is translated to copying each byte separately.

copy(char*, char*):
        mov     eax, DWORD PTR [rsi]
        mov     DWORD PTR [rdi], eax
        ret
copyOffset(char*, char*, unsigned long):
        movzx   eax, BYTE PTR [rsi+rdx]
        mov     BYTE PTR [rdi], al
        movzx   eax, BYTE PTR [rsi+1+rdx]
        mov     BYTE PTR [rdi+1], al
        movzx   eax, BYTE PTR [rsi+2+rdx]
        mov     BYTE PTR [rdi+2], al
        movzx   eax, BYTE PTR [rsi+3+rdx]
        mov     BYTE PTR [rdi+3], al
        ret

Meanwhile, the function copy16() and copyOffset16() are both vectorized as expected.

copy16(char*, char*):
        vmovdqu xmm0, XMMWORD PTR [rsi]
        vmovdqu XMMWORD PTR [rdi], xmm0
        ret
copyOffset16(char*, char*, unsigned long):
        vmovdqu xmm0, XMMWORD PTR [rsi+rdx]
        vmovdqu XMMWORD PTR [rdi], xmm0
        ret

So why isn't copyOffset() optimized by the compiler to use mov DWORD? Also, is there any pair of statements in main() that is unsafe or might behave unexpectedly?

Edit: switching to x86-64 gcc (trunk) causes gcc to emit the expected instruction. So I guess this behavior is just due to compiler heuristic.

TL:DR: it's a missed optimization in GCC11, fixed in trunk. And also a missed optimization in clang, which makes it worse by using byte-`mov` into AL, causing a false dependency. The currently-accepted answer is wrong about almost every separate point it makes, including numbers from Agner's instruction table it links; IMO you should un-accept it to not mislead future readers. — Peter Cordes, Dec 24 '21 at 14:54

score 1 · Answer 1 · 2021-12-23T04:47:10.413

1

Because it is more efficient that way. XMM (SSE) or any other SIMD instructions in general tend to be very heavy and as such have very high latency. The compiler takes this in account.

This obsession with SSE/AVX is really overblown. SSE/AVX can be useful in very specific optimizations but in 99% of where they are used by compilers they are actually ineffective.

Anecdotal evidence but I once recompiled a very large optimization binary with "-march=x86_64" which removes almost all SIMD extensions and the performance was actually better that with "-march=native".

A simple MOVB (byte) operation has a latency of 1 cycle and can execute up to 4 operations at the same time on Intel Skylake. So all the 8 instructions in there can be executed effectively in 2 cycles.

The MOVUPS instruction has about 5 cycles latency with a maximum throughput of 2 (at the same time), depending on platform and you need two of them: from memory into %xmm0 and from %xmm0 to memory, in total from 5 to 10 cycles.

Of course, things are way more complex than this because of the pipeline behavior, micro-ops, ports, L1/2/3 cache etc but this is a first-order attempt to explain what the compiler is doing.

Reference: https://www.agner.org/optimize/instruction_tables.pdf

Code: https://godbolt.org/z/3KjxfdW3W

edited Dec 23 '21 at 04:47

answered Dec 23 '21 at 04:39

1

But it doesn't need to use a big SIMD instruction. Why doesn't it use a simple move DWORD instruction, like in copyAsInt() ? – Huy Le Dec 23 '21 at 04:45
If you look back at the table, a `mov 32/64,m` will have a latency of 2 with throughput of 2 against a latency of 1 and throughput of 4 for `mov 8,m`. So it more than makes up. And you got to understand that the data is not really being moved from and to memory with each of these instructions. They are at maximum being moved in and out the L1 cache, if at all. – Dec 23 '21 at 04:50
1

Actually GCC does exactly what you are saying. https://godbolt.org/z/zf5h9xrqz This is a matter of different heuristic formulas, that's my bet. – Dec 23 '21 at 05:01
Okay so both function are perfectly safe (even if source is not 4-bytes aligned). Is that correct? – Huy Le Dec 23 '21 at 05:32
1

Yes that is correct - for x86, not for other architectures like ARM – Dec 23 '21 at 10:22
Which section of Agner's table has wrong data for `mov r32/64, m` vs `mov r8, m` loads? Merging into the low byte of a register is always worse than writing a whole register. Throughput should be 0.5 for both, with latency at best equal (on P6-family that renames partial registers), or slower for byte loads on most. That's why GCC uses `movzx` not `mov byte`. The load latencies in Agner's tables are basically made up, he says in the intro section he arbitrarily divides the store/reload store-forwarding latency between the load and store instructions, so it doesn't reflect load-use latency. – Peter Cordes Dec 23 '21 at 11:10
On Skylake, Agner's tables clearly show that scalar stores (`mov m, r`) on Skylake have 1/clock throughput, regardless of width (1,2,4,8 bytes). And yeah, `mov r8l, m` is 1 uop fused-domain, micro-fused load+ALU-merge. Agner doesn't list a latency for that or `movzx r,m`, but he does list 0.5c throughput. (See [How exactly do partial registers on Haswell/Skylake perform? Writing AL seems to have a false dependency on RAX, and AH is inconsistent](//stackoverflow.com/q/45660139), although that's not really relevant because GCC correctly uses movzx byte loads when it fails to merge (@HuyLe) – Peter Cordes Dec 23 '21 at 11:13
See https://uops.info/ for automated testing, which gets minimum latency by putting it in a loop with a dep chain. They can get load/use latency (when the dep chain couples to the load address), but when doing a store to couple back to the load data, they just choose to assume other operations in the loop are at least 1 cycle, and show latency as <= 2, instead of trying to arbitrarily divide store-forwading latency between store and load. – Peter Cordes Dec 23 '21 at 11:17
1

@HuyLe: to safely express an unaligned strict-aliasing-safe load or store in C, either use `memcpy` or in GNU C `typedef uint32_t aliasing_unaliged_uint32 __attribute__((aligned(1), may_alias))`, as in [Why does unaligned access to mmap'ed memory sometimes segfault on AMD64?](https://stackoverflow.com/q/47510783) and [Why does glibc's strlen need to be so complicated to run quickly?](https://stackoverflow.com/a/57676035) – Peter Cordes Dec 23 '21 at 11:20
1

And BTW, in case my earlier comments weren't clear, this answer is nonsense, based on a misreading of Agner's instruction tables and a misconception about SIMD; x86 SIMD is very tightly coupled to the integer part of the core, with relatively tight latency, unlike some ARM microarchitectures which stall when getting stuff to/from the SIMD part of the chip. This is a GCC missed optimization that's fixed in trunk. Unaligned loads/stores are cheap enough that it's worth merging even if that introduces a misaligned load or store, especially if it's not too wide so likely won't split a cache line – Peter Cordes Dec 23 '21 at 11:23
2

This answer wouldn't even make sense if it were true, the code that GCC acted weirdly on copied 4 bytes so it doesn't make sense to use a vector load/store there to begin with, it makes sense to do a DWORD load/store and *that's* what GCC failed to do. – harold Dec 23 '21 at 11:33
Oh, the Godbolt link in this answer is for clang trunk, which unfortunately misses the load/store merging optimization. And uses `movb` to create a false dependency through RAX on all the loads. (Just the ALU merging part; the actual load uops can still overlap). – Peter Cordes Dec 23 '21 at 11:36

C++ assembly why isn't this "source 4 bytes to dest 4 bytes" assignment vectorized by g++?

1 Answers1