4

_mm_stream_load_si128 is declared as

__m128i _mm_stream_load_si128 (__m128i * mem_addr)

while _mm_load_si128 is declared as

__m128i _mm_load_si128 (__m128i const* mem_addr)

Does the former modify the contents of what mem_addr points to? If not, what's the motivation for the non-const declaration?

MWB
  • 11,740
  • 6
  • 46
  • 91
  • @J...: Actually NT *loads* do *not* override the memory-order semantics of the memory region. `movntdqa` on normal WB memory is just a slow `movdqa` on current CPUs; they don't do anything with the NT hint to try to minimize cache pollution (e.g. by only loading into one "way" of L3 like `prefetchnta` does). See also [Non-temporal loads and the hardware prefetcher, do they work together?](https://stackoverflow.com/q/32103968) - there is HW prefetch isn't NT-aware, so it wouldn't really make sense to actually bypass cache. It only does anything on WC memory which is already weakly ordered. – Peter Cordes Sep 08 '21 at 21:13
  • @J... What you said sounds like it was talking specifically about NT stores (not SSE4.1 NT loads this question is about), especially with the mention of `sfence`. Like other loads, `movntdqa` loads are not ordered wrt. `sfence`. [`movntdqa`](//www.felixcloutier.com/x86/movntdqa) is specifically a load. The SSE2 SIMD-integer NT store is [`movntdq`](//www.felixcloutier.com/x86/movntdq), and like other NT *stores* does do what you say, overriding the strong mem-order semantics of WB memory regions to be more like WC. That's why libc memset / memcpy use it for large copies. (But not NT loads). – Peter Cordes Sep 09 '21 at 00:05
  • @PeterCordes Quite right, that's exactly where my mind was. I missed, oddly enough, that this was about loads, lol. Must be getting tired. – J... Sep 09 '21 at 00:10

1 Answers1

4

I think it is declared this way for no reason. See _mm256_stream_load_si256 and _mm512_stream_load_si512 that are the same for wider operand, they take const argument.

Also in <smmintrin.h> that comes with Visual Studio 2015 it is const:

/*
 * Load double quadword using non-temporal aligned hint
 */

extern __m128i _mm_stream_load_si128(const __m128i*);
Yun
  • 3,056
  • 6
  • 9
  • 28
Alex Guteniev
  • 12,039
  • 2
  • 34
  • 79
  • The underlying instruction is a pure load with no side-effects, not even overriding the memory-order semantics of the memory region. (i.e. it only does anything special on WC memory; on current CPUs the NT hint is fully ignored on WB (normal write-back cacheable) so it's just a slow `movdqa` costing an extra uop.) – Peter Cordes Sep 08 '21 at 21:15