I noticed a weird thing today. When copying a long double
1 all of gcc
, clang
and icc
generate fld
and fstp
instructions, with TBYTE
memory operands.
That is, the following function:
void copy_prim(long double *dst, long double *src) {
*src = *dst;
}
Generates the following assembly:
copy_prim(long double*, long double*):
fld TBYTE PTR [rdi]
fstp TBYTE PTR [rsi]
ret
Now according to Agner's tables this is a poor choice for performance, as fld
takes four uops (none fused) and fstp
takes a whopping seven uops (none fused) versus say a single fused uop each for movaps
to/from an xmm
register.
Interestingly, clang
starts using movaps
as soon as you put the long double
in a struct
. The following code:
struct long_double {
long double x;
};
void copy_ld(long_double *dst, long_double *src) {
*src = *dst;
}
Compiles to the same assembly with fld
/fstp
as previously shown for gcc
and icc
but clang
now uses:
copy_ld(long_double*, long_double*):
movaps xmm0, xmmword ptr [rdi]
movaps xmmword ptr [rsi], xmm0
ret
Oddly, if you stuff an additional int
member into the struct
(which doubles its size to 32 bytes due to alignment), all compilers generate SSE-only copy code:
copy_ldi(long_double_int*, long_double_int*):
movdqa xmm0, XMMWORD PTR [rdi]
movaps XMMWORD PTR [rsi], xmm0
movdqa xmm0, XMMWORD PTR [rdi+16]
movaps XMMWORD PTR [rsi+16], xmm0
ret
Is there any functional reason to copy floating point values with fld
and fstp
or is just a missed optimization?
1 Although a long double
(i.e., x86 extended precision float) is nominally 10 bytes on x86, it has sizeof == 16
and alignof == 16
since alignments have to be a power of two and the size must usually be at least as large as the alignment.