movntss
is AMD-only (SSE4A), and supported from K10 onwards. It's slower than movntps
, though, on Bulldozer-family and Ryzen. (One per 4c throughput vs. one per 1c for Ryzen's movntps xmm
.)
movnti
(from an integer register) has the same throughput as movntps xmm
on AMD Piledriver (2c), Steamroller (1c), and Ryzen (1c).
movnti
is part of SSE2, so it's available (and efficient) on Intel CPUs.
Your numbers are integers (and you need them in integer registers anyway to use the low bits as an array index), so if you were going to use NT stores for this, you'd use movnti
not movntss
.
on a CPU with 24KB 6-way set associative data caches
All CPUs with SSE2 have much larger L2 caches which you need to consider. An L2 hit is much much faster than RAM.
That's a very unique size. You have an Intel Silvermont or in-order Atom (Bonnell or Saltwell) with 24kiB L1D and at least 512 KiB L2 cache (per core or shared between a pair of corse).
But anyway, not an AMD at all, so movss
was never an option. AMD's low-power Bobcat / Jaguar have normal 32k L1d caches, and their mainstream cores have 64kiB (K8/K10), 16kiB (Bulldozer-family) or 32kiB (Ryzen) L1d caches, and all have much larger L2 caches.
More importantly, write-back L1d + L2 caches will effectively give you write-combining for your output buckets. I don't think you want NT stores at all.
You do need your int *x[]
array to stay hot in L1d because you're read-modify-writing it inside the loop. But I think that will normally happen with normal LRU cache algorithms.
NT stores are terrible with too many output streams. They're by far the most good when you can store a complete cache line before the line-fill buffer is flushed, which happens if the memory subsystem needs it for other lines coming into / going out from L1d.
On mainstream Intel, each core has 10 LFBs since Nehalem. Where is the Write-Combining Buffer located? x86. (With hyperthreading, they're shared between cores, but IDK if it's static partitioning like the store buffer or competitive sharing like L1d itself.)
On mainstream cores (IDK about Atom/Silvermont) NT stores have higher latency before handing off the cache line to outer levels of the memory subsystem (Enhanced REP MOVSB for memcpy), but avoiding a RFO might possibly be an advantage. You'd have to measure.
My biggest concern is that this would be terrible if there is any pattern in your data that leads to multiple not-quite consecutive stores to the same bucket. A pattern that L1d could have absorbed could be terrible with NT stores that flush before the the next store can join it in a write-combining buffer.
so this code would lead to lots of cache miss
You might be better off doing two passes; the first pass using few enough buckets that the output bins stay hot in cache most of the time (at least if you skew them so they aren't all hitting the same set in your cache).
Then sort each bucket separately; ideally it will fit in L1d cache.