1

To do radix sort for numbers in [0, 220) on a CPU with 24KB 6-way set associative data caches, if base 210 is chosen, only 24B cache can be provided for each digit, so this code would lead to lots of cache miss:

int *x[1024], c[1024]={0}; 
for(int i=0; i<n; i++)c[A[i]&1023]++;
for(int i=0,s=0; i<1024; i++){x[i]=B+s; s+=c[i];}
for(int i=0; i<n; i++)*(x[A[i]&1023]++)=A[i]; // each ptr require 64B+ cache

So I think of skipping the cache and directly store values into memory with MOVNTSS, or simulate 16B cache and store with MOVNTPS. How performance loss for MOVNTSS and cache simulating? Or depends on what?

l4m2
  • 1,157
  • 5
  • 17
  • Do the pointers in the x array point to (mostly) sequential locations? – Hadi Brais Jul 05 '18 at 12:14
  • @HadiBrais https://paste.ubuntu.com/p/wF5Rrf27W2/ – l4m2 Jul 05 '18 at 15:04
  • I don't quite know the distribution – l4m2 Jul 05 '18 at 15:05
  • 1
    You only benefit from NT stores when the stores are mostly sequential in the virtual address space. Your processor has 5 write-combining buffers per logical core, each can hold only a single cache line. If sequential writes are not to the same cache line, then the core has to wait to free a WC buffer even if it was only partially written to. This requires writing to main memory directly, which is very slow. If you have something like a "likely sequential access", then it might be a good idea to use NT stores. Overall, I can't say for sure. You have to measure it. – Hadi Brais Jul 05 '18 at 15:31
  • @HadiBrais Do NT store buffer need to wait for a [write success] to ready for next write? – l4m2 Jul 05 '18 at 16:16
  • I'm not sure I understand your question. See the link from Peter's answer for basic information on how these buffers work. – Hadi Brais Jul 05 '18 at 16:29

1 Answers1

1

movntss is AMD-only (SSE4A), and supported from K10 onwards. It's slower than movntps, though, on Bulldozer-family and Ryzen. (One per 4c throughput vs. one per 1c for Ryzen's movntps xmm.)

movnti (from an integer register) has the same throughput as movntps xmm on AMD Piledriver (2c), Steamroller (1c), and Ryzen (1c). movnti is part of SSE2, so it's available (and efficient) on Intel CPUs.

Your numbers are integers (and you need them in integer registers anyway to use the low bits as an array index), so if you were going to use NT stores for this, you'd use movnti not movntss.


on a CPU with 24KB 6-way set associative data caches

All CPUs with SSE2 have much larger L2 caches which you need to consider. An L2 hit is much much faster than RAM.

That's a very unique size. You have an Intel Silvermont or in-order Atom (Bonnell or Saltwell) with 24kiB L1D and at least 512 KiB L2 cache (per core or shared between a pair of corse).

But anyway, not an AMD at all, so movss was never an option. AMD's low-power Bobcat / Jaguar have normal 32k L1d caches, and their mainstream cores have 64kiB (K8/K10), 16kiB (Bulldozer-family) or 32kiB (Ryzen) L1d caches, and all have much larger L2 caches.


More importantly, write-back L1d + L2 caches will effectively give you write-combining for your output buckets. I don't think you want NT stores at all.

You do need your int *x[] array to stay hot in L1d because you're read-modify-writing it inside the loop. But I think that will normally happen with normal LRU cache algorithms.


NT stores are terrible with too many output streams. They're by far the most good when you can store a complete cache line before the line-fill buffer is flushed, which happens if the memory subsystem needs it for other lines coming into / going out from L1d.

On mainstream Intel, each core has 10 LFBs since Nehalem. Where is the Write-Combining Buffer located? x86. (With hyperthreading, they're shared between cores, but IDK if it's static partitioning like the store buffer or competitive sharing like L1d itself.)

On mainstream cores (IDK about Atom/Silvermont) NT stores have higher latency before handing off the cache line to outer levels of the memory subsystem (Enhanced REP MOVSB for memcpy), but avoiding a RFO might possibly be an advantage. You'd have to measure.

My biggest concern is that this would be terrible if there is any pattern in your data that leads to multiple not-quite consecutive stores to the same bucket. A pattern that L1d could have absorbed could be terrible with NT stores that flush before the the next store can join it in a write-combining buffer.


so this code would lead to lots of cache miss

You might be better off doing two passes; the first pass using few enough buckets that the output bins stay hot in cache most of the time (at least if you skew them so they aren't all hitting the same set in your cache).

Then sort each bucket separately; ideally it will fit in L1d cache.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • 1
    MOVNTSS is part of AMD's SSE4a instruction set, first implemented in K10 and supported by all later AMD processors. Intel never supported SSE4a. MOVNTSS performs a 32-bit NT store from an XMM register to memory. SSE4a also offers MOVNTSD, which is a 64-bit NT store from an XMM. – Hadi Brais Jul 05 '18 at 12:11
  • 10 LFBs per physical core. However, if that core is hyperthreaded, the LFBs are statically partitioned (IIRC), and so each logical core would have only 5 LFBs. – Hadi Brais Jul 06 '18 at 07:46
  • Note that the OP's CPU is mentioned at the end of the [file](https://stackoverflow.com/questions/51186224/how-much-performance-loss-for-movntss#comment89370739_51186224) they shared. I also find this part `on a CPU with 24KB 6-way set associative data caches, if base 2^10 is chosen, only 24B cache can be provided for each digit` very confusing. – Hadi Brais Jul 06 '18 at 08:02
  • @HadiBrais:Oh, you were talking about a Haswell i3. If the OP wants to update the question, that's fine, but until them I'm going to answer it as written, and it's about an Atom or Silvermont. I couldn't find anything about the number of LFBs on Atom / Silvermont. Also, I'm not sure if the LFBs are statically partitioned on Haswell. L1d cache itself isn't, and they're used for pending lines to/from L1d. But Agner's uarch guide doesn't mention them, and Intel's optimization manual doesn't say how they're shared between logical cores. :/ – Peter Cordes Jul 06 '18 at 08:11
  • I guess static is plausible for fairer B/W, but it would cut the single-thread bandwidth in ~half if one core is memory intensive and the other isn't. – Peter Cordes Jul 06 '18 at 08:13
  • 1
    For Atom, it is mentioned in D.2 of the Intel optimization manual that the memory execution sub-system has 8 WC buffers. Older uarchs such as Intel Core has also 8 fill buffers as mentioned in 2.5.4 of the manual. I can't find right now a source that says they are statically partitioned. But why does Intel say `Only four write- combining buffers are guaranteed to be available for simultaneous use` in 3.6.10? I think this indicates that they are statically partitioned when hyperhreading is enabled and also considers uarchs that have only 8 LFBs. – Hadi Brais Jul 06 '18 at 08:23
  • Yes, the L1d is shared. – Hadi Brais Jul 06 '18 at 08:25
  • @HadiBrais: Ah good find, I was only searching on words, not reading sections. I guess I could test by pinning a `pause` or `divps` loop to one core, and a memcpy or 256b-vector triad loop to another core, and see how much throughput suffers. – Peter Cordes Jul 06 '18 at 08:40
  • PAUSE does NOT make statically partitioned resources free for use by the other logical core. HLT, MWAIT/MONITOR, or completely disabling hyperthreading achieve that. – Hadi Brais Jul 06 '18 at 08:45
  • 1
    @HadiBrais: Yeah, so it's perfect for testing whether a resource is statically partitioned or competitively shared, by comparing with the normal situation where a program has the whole core to itself. (BTW, there's a `cpu_clk_thread_unhalted.one_thread_active` perf counter you can use to verify that your test had the core to itself the whole time.) – Peter Cordes Jul 06 '18 at 08:58
  • 1
    I would be pretty surprised if LFBs were statically partitioned rather than competitively shared. It would be a big blow to certain hyperthreaded scenarios and it's not obvious to me why it would be necessary. – BeeOnRope Jul 07 '18 at 01:42