3

Is there exist a syntax to force C compiler to use memory operand directly ?

In the good old asm time we simply write in the instruction where to take operand - 'real' register or memory pointer (location pointed by address).

But in the intrinsics pseudo-asm for C I do not see the way to force compiler to use memory pointer in the instruction (reject to load data from memory (cache) to 'register' i.e. trash register file loaded content to cache and cause reloading with penalty).

I understand it is easy to programmer to simply write 'variable' operand to instinsic and let compiler decide if load from memory first or use it directly (if possible).

Current task: I want to calculate SAD of a sequence of 8x8 8bit blocks at AVX2 CPU with 512 byte register file (16 ymm 'registers' of 32bytes each). So it can load 8 8x8 8bit source blocks to fully fill available AVX2 register file.

I want to load source blocks in all register file and test different 'ref' locations from memory against these source blocks and each ref location only once. So I want to prevent CPU from loading ref blocks from cache to register file and use 'memory operand' in sad instruction.

With asm we simply write something like

(load all 16 ymm registers with src)
vpsadbw ymm0, ymm0, [ref_base_address_register + some_offset...]

But at the C-text with intrinsic it is

__m256i src = load_src(src_pointer);
__m256i ref = load_ref(ref_pointer); 
__m256i sad_result= _mm256_sad_epu8(src, ref)

It do not have ways to point compiler to use valid memory operand like

__m256i src = load_src(src_pointer);
__m256i sad_result= _mm256_sad_epu8(src, *ref_pointer)

Or depend on the 'task size' if compiler will run out of available registers it will automatically switched to memory operand version and programmer can write

__m256i sad_result=_mm256_sad_epu8(*(__m256i*)src_pointer, *(__m256i*)ref_pointer)

and expect compiler will load one of 2 operands to register file and use next from memory ?

DTL2020
  • 71
  • 3

1 Answers1

5

No, there isn't, except for a few specific intrinsics that have a pointer operand even though they aren't pure load or pure store1.

Part of the purpose of intrinsics is to abstract away register allocation details, just like it does for int or double, so it's up to the compiler to keep stuff in registers when that's a good thing. This does normally happen, so check the asm output if you're worried that the optimizer failed to fold a load intrinsic into a memory source operand (e.g. on https://godbolt.org/ or locally). AVX (VEX encoding) allows folding even unaligned loads, because unlike legacy-SSE, alignment isn't required by default.

This can suck when compilers fail at it, like many used to for _mm256_cvtepu8_epi32( _mm_loadl_epi64(p) ) - GCC used to emit an actual movq load and a reg-reg vpmovzxbd. Only in GCC9 and later do we get a memory-source vpmovzxbd. (Loading 8 chars from memory into an __m256 variable as packed single precision floats)

Or for your case, if the compiler is spilling the wrong things, the only fix is to file a missed-optimization bug report and wait for a new compiler version. Or to write a version in asm (inline or stand-alone).


The designers of the intrinsics model also wanted to provide load/loadu and store/storeu intrinsics to communicate alignment info to the compiler. (And for float/double, to cast between float* and __m128* or whatever.) _mm_load_si128((__m128i*)foo) is exactly identical to *(__m128i*)foo and pretty much the same as accessing an element of an array of __m128i, if the compiler can't see through the array and keep it in registers. See Is `reinterpret_cast`ing between hardware SIMD vector pointer and the corresponding type an undefined behavior?

The load intrinsics confusingly look like asm loads/store, but they're actually fundamentally different when optimizations are enabled.


Footnote 1: AVX-512 has some special instructions that have correspondingly interesting intrinsics like VPMOVDB mem128 {k}, zmm2 - void _mm512_mask_cvtepi32_storeu_epi8(void * d, __mmask16 k, __m512i a);. Being able to store to memory gave Xeon Phi (Knight's Landing) a way to do byte-masked stores without AVX-512BW for vmovdqu8.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847