5

Consider massiveley SIMD-vectorized loops on very large amounts of floating point data (hundreds of GB) that, in theory, should benefit from non-temporal ("streaming" i.e. bypassing cache) loads/store.

Using non-temp store (_mm256_stream_ps) actually does significantly improve throughput by about ~25% over plain store (_mm256_store_ps)

However, I could not measure any difference when using _mm256_stream_load instead of _mm256_load_ps.

Does anyone have an example where _mm256_stream_load_si256 can be used to actually improves performance ?

(Instruction set & Hardware is AVX2 on AMD Zen2, 64 cores)

for(size_t i=0; i < 1000000000/*larger than L3 cache-size*/; i+=8 )
{
  #ifdef USE_STREAM_LOAD
  __m256 a = _mm256_castsi256_ps (_mm256_stream_load_si256((__m256i *)source+i));
  #else
  __m256 a = _mm256_load_ps( source+i );
  #endif

   a *= a;

  #ifdef USE_STREAM_STORE
  _mm256_stream_ps (destination+i, a);
  #else
  _mm256_store_ps (destination+i, a);
  #endif
}
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
zx-81
  • 103
  • 5

1 Answers1

9

stream_load (vmovntdqa) is just a slower version of normal load (extra ALU uop) unless you use it on a WC memory region (uncacheable, write-combining).

The non-temporal hint is ignored by current CPUs, because unlike NT stores, the instruction doesn't override the memory ordering semantics. We know that's true on Intel CPUs, and your test results suggest the same is true on AMD.

Its purpose is for copying from video RAM back to main memory, as in an Intel whitepaper. It's useless unless you're copying from some kind of uncacheable device memory. (On current CPUs).

See also What is the difference between MOVDQA and MOVNTDQA, and VMOVDQA and VMOVNTDQ for WB/WC marked region? for more details. As my answer there points out, what can sometimes help if tuned carefully for your hardware and workload, is NT prefetch to reduce cache pollution. But tuning the prefetch distance is pretty brittle; too far and data will be fully evicted by the time you read it, instead of just missing L1 and hitting in L2.

There wouldn't be much if anything to gain in bandwidth anyway. Normal stores cost a read + an eventual write on eviction for each cache line. The Read For Ownership (RFO) is required for cache coherency, and because of how write-back caches work that only track dirty status on a whole-line basis. NT stores can increase bandwidth by avoiding those loads.

But plain loads aren't wasting anything, the only downside is evicting other data as you loop over huge arrays generating boatloads of cache misses, if you can't change your algorithm to have any locality.


If cache-blocking is possible for your algorithm, there's much more to gain from that, so you don't just bottleneck on DRAM bandwidth. e.g. do multiple steps over a subset of your data, then move on to the next.

See also How much of ‘What Every Programmer Should Know About Memory’ is still valid? - most of it; go read Ulrich Drepper's paper.

Anything you can do to increase computational intensity helps (ALU work per time the data is loaded into L1d cache, or into registers).

Even better, make a custom loop that combines multiple steps that you were going to do on each element. Avoid stuff like for(i) A[i] = sqrt(B[i]) if there is an earlier or later step that also does something simple to each element of the same array.

If you're using NumPy or something, and just gluing together optimized building blocks that operate on large arrays, it's kind of expected that you'll bottleneck on memory bandwidth for algorithms with low computational intensity (like STREAM add or triad type of things).

If you're using C with intrinsics, you should be aiming higher. You might still bottleneck on memory bandwidth, but your goal should be to saturate the ALUs, or at least bottleneck on L2 cache bandwidth.

Sometimes it's hard, or you haven't gotten around to all the optimizations on your TODO list that you can think of, so NT stores can be good for memory bandwidth if nothing is going to re-read this data any time soon. But consider that a sign of failure, not success. CPUs have large fast caches, use them.


Further reading:

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • "The non-temporal hint is ignored by current CPUs" thanks a lot for this information. This explains a lot. (I thought that completely bypassing all caches must _somehow_ improve performance also for _loading_ from RAM. But good to know that this is not the case. Spares a lot of time futile experimenting & tuning. – zx-81 Aug 27 '22 at 10:54
  • 2
    @zx-81: Cheers. After posting this, I realized I had more to say; see my update. – Peter Cordes Aug 27 '22 at 10:59
  • The actual code uses cache blocking and the other strategies you suggested most of the time. But there are still low arithmetic intensity "sweep-like" operations that inherently only use each input array slot exactly once (for example a fast-sweeping front propagation algorithm and a 7-point jacobi stencil sweep as part of a multigrid smoother. Those are algorithimcally bottlenecked on RAM bandwidth (120 GB/sec. in my case) – zx-81 Aug 27 '22 at 11:01
  • @zx-81: Yup, then that would be one of those cases where you just have to suck it up, and it's usually fine as long as it's not where your code spends most of its time. Do note that you'll have a hard time getting 120GB/sec memory bandwidth from a single core (due to limited LFBs and *higher* latencies on server chips, single-core bandwidth on a big Xeon is [ironically lower than on a desktop](https://stackoverflow.com/questions/39260020/why-is-skylake-so-much-better-than-broadwell-e-for-single)). But if you can split it up across threads, mem B/W can be saturated with most of the cores. – Peter Cordes Aug 27 '22 at 11:08
  • 2
    @zx-81: And glad to hear you didn't need that explanation on boosting computational intensity, but hopefully it'll benefit some future readers. – Peter Cordes Aug 27 '22 at 11:09
  • Thanks again for your extremely helpful answers. You hit the (often missing) right spot between whole technical papers (too datailed, no time to read) and casual (short but often too general ) answers – zx-81 Aug 27 '22 at 12:08
  • 2
    @zx-81: Thanks, yeah, that's exactly what I aim for. :) A few other people write some technical-enough CPU-architecture answers these days (although BeeOnRope isn't very active on SO any more), but a good fraction of older performance-related questions have very general answers with details often wrong if present at all. Over the years I've been able to improve that some, and upvote some existing old answers with good stuff. – Peter Cordes Aug 27 '22 at 12:12