Does _mm_stream_load_si128 (movntdqa) modify the memory its argument points to?

Question

_mm_stream_load_si128 is declared as

__m128i _mm_stream_load_si128 (__m128i * mem_addr)

while _mm_load_si128 is declared as

__m128i _mm_load_si128 (__m128i const* mem_addr)

Does the former modify the contents of what mem_addr points to? If not, what's the motivation for the non-const declaration?

@J...: Actually NT *loads* do *not* override the memory-order semantics of the memory region. `movntdqa` on normal WB memory is just a slow `movdqa` on current CPUs; they don't do anything with the NT hint to try to minimize cache pollution (e.g. by only loading into one "way" of L3 like `prefetchnta` does). See also [Non-temporal loads and the hardware prefetcher, do they work together?](https://stackoverflow.com/q/32103968) - there is HW prefetch isn't NT-aware, so it wouldn't really make sense to actually bypass cache. It only does anything on WC memory which is already weakly ordered. — Peter Cordes, Sep 08 '21 at 21:13
@J... What you said sounds like it was talking specifically about NT stores (not SSE4.1 NT loads this question is about), especially with the mention of `sfence`. Like other loads, `movntdqa` loads are not ordered wrt. `sfence`. [`movntdqa`](//www.felixcloutier.com/x86/movntdqa) is specifically a load. The SSE2 SIMD-integer NT store is [`movntdq`](//www.felixcloutier.com/x86/movntdq), and like other NT *stores* does do what you say, overriding the strong mem-order semantics of WB memory regions to be more like WC. That's why libc memset / memcpy use it for large copies. (But not NT loads). — Peter Cordes, Sep 09 '21 at 00:05
@PeterCordes Quite right, that's exactly where my mind was. I missed, oddly enough, that this was about loads, lol. Must be getting tired. — J..., Sep 09 '21 at 00:10

score 4 · Accepted Answer · edited Sep 08 '21 at 19:11

4

I think it is declared this way for no reason. See _mm256_stream_load_si256 and _mm512_stream_load_si512 that are the same for wider operand, they take const argument.

Also in <smmintrin.h> that comes with Visual Studio 2015 it is const:

/*
 * Load double quadword using non-temporal aligned hint
 */

extern __m128i _mm_stream_load_si128(const __m128i*);

edited Sep 08 '21 at 19:11

Yun

3,056
6
9
28

answered Sep 08 '21 at 18:42

Alex Guteniev

12,039
2
34
79

The underlying instruction is a pure load with no side-effects, not even overriding the memory-order semantics of the memory region. (i.e. it only does anything special on WC memory; on current CPUs the NT hint is fully ignored on WB (normal write-back cacheable) so it's just a slow `movdqa` costing an extra uop.) – Peter Cordes Sep 08 '21 at 21:15

Does _mm_stream_load_si128 (movntdqa) modify the memory its argument points to?

1 Answers1