0

TL;DR: I understood MOVNTI operations are not ordered relative to the rest of the program, so SFENCE/MFENCE is needed. But are MOVNTI operations not ordered relative to other MOVNTI operations of the same thread?


Assuming I have a producer-consumer queue, and I want to use MOVNTI on producer side to avoid cache pollution.

(Has not actually observed cache pollution effect yet, so it is probably theory question for now)

So I'm replacing the following producer:

std::atomic<std::size_t> producer_index;
QueueElement queue_data[MAX_SIZE];
...
void producer()
{
    for (;;)
    {
        ...

        queue_data[i].d1 = v1;
        queue_data[i].d2 = v2;
        ...
        queue_data[i].dN = vN;

        producer_index.store(i, std::memory_order_release);
    }
}

With the following:

void producer()
{
    for (;;)
    {
        ...

        _mm_stream_si64(&queue_data[i].d1, v1);
        _mm_stream_si64(&queue_data[i].d2, v2);
        ...
        _mm_stream_si64(&queue_data[i].dN, vN);

        _mm_sfence();

        producer_index.store(i, std::memory_order_release);
    }
}

Notice I added _mm_sfence, which would wait until "non-temporal" operation results become observable. If I don't add it, consumer may observe producer_index before queue_data changes.

But what if I write index with _mm_stream_si64 too?

std::size_t producer_index_value;
std::atomic_ref<std::size_t> producer_index { producer_index_value };

void producer()
{
    for (;;)
    {
        ...

        _mm_stream_si64(&queue_data[i].d1, v1);
        _mm_stream_si64(&queue_data[i].d2, v2);
        ...
        _mm_stream_si64(&queue_data[i].dN, vN);

        _mm_stream_si64(&producer_index_value, i);
    }
}

According to my reading of Intel manuals, this shouldn't work, as non-temporal store has relaxed ordering.

But didn't they say "relaxed" only to make non-temporal operation not ordered against the rest of the program? Maybe they are ordered within themselves, so the producer still would work as expected?

And if MOVNTI is truly relaxed, so that the latest code is incorrect, what is the reason for memory writes to be reordered?

Alex Guteniev
  • 12,039
  • 2
  • 34
  • 79

1 Answers1

2

movnti stores are weakly ordered relative to each other as well. In asm you definitely need sfence after storing the data to get release semantics for the store to producer_index, whether you do that with movnti or a plain mov store.

It might happen to work most of the time that the separate store wouldn't become visible to other threads until after some full-line writes using NT stores. Likely in fact: completing a cache line triggers a flush of the WC buffer to DRAM (bypassing / evicting cache), but the index will definitely not be a full line store unless it happens to be contiguous with the end of the data written.

In C++ that means using _mm_sfence() before whatever you do to store to producer_index.


Note that using movnti for a single scalar is a really bad idea: it forces the cache line to be evicted from cache so the reader have to fetch it all the way from DRAM. i.e. it will increase inter-core latency for that control variable that otherwise would probably hit in L3.

Only use NT stores when you expect to complete a whole cache line, and when you don't expect another thread to be reloading the data soon.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • "_Only use NT stores when [...] you don't expect another thread to be reloading the data soon._" I've noticed that the first version (without any `MOVNTI`) is faster, and I didn't find it unexpected. So is the use of `MOVNTI` in a producer-consumer queue never reasonable? – Alex Guteniev Jun 06 '20 at 19:03
  • 1
    @AlexGuteniev: only if the data is larger than L3 cache size, and/or the consumer doesn't tend to read it any time soon. NT stores force eviction / invalidation of the cache line, if it was already hot in this core or another, or in L3. Or possibly if cache misses for the data in the reader aren't important, compared to not polluting other stuff in cache that's more scattered and would be handled less well by HW prefetch. – Peter Cordes Jun 07 '20 at 00:26
  • 2
    Producer-consumer performance will depend on implementation and configuration details of the system. If the producer and consumer share a cache, then the data should be written to that cache. If the producer and consumer do not share a cache, then it is often (not always) faster to push the data to memory. https://patents.google.com/patent/US20090216950 – John D McCalpin Jun 08 '20 at 23:01
  • @JohnDMcCalpin, am I understanding correctly that the patent covers an instruction similar to `clflush`, but which evicts data to a shared cache instead of DRAM? And I wouldn't run into patent issue by just implementing a SPSC queue on x64 processor? – Alex Guteniev Jun 09 '20 at 07:22
  • @AlexGuteniev: I didn't read the patent, but an anti-prefetch hint to write back sounds like `cldemote` (Tremont (Atom) and later). I described the difference between that vs. movnt or clflushopt vs. clwb in [How to force cpu core to flush store buffer in c?](https://stackoverflow.com/q/54067605). Also [Is there any way to write for Intel CPU direct core-to-core communication code?](https://stackoverflow.com/q/58741806) / [Does hardware memory barrier make visibility of atomic operations faster in addition to providing necessary guarantees?](https://stackoverflow.com/q/61591287) – Peter Cordes Jun 09 '20 at 08:13
  • 2
    @AlexGuteniev The "push for sharing" instruction is intended to allow the implementation to define what location in memory is "best" for inter-processor communication, given the topology and cache sharing in the system, without specifying a particular target. The Tensilica processor architecture allows for various FIFO configurations, and some Cray vector machines included "globally shared registers" that could be used for communication. Some more context on the "cache hint" approach: https://sites.utexas.edu/jdm4372/2019/02/18/intels-future-cldemote-instruction/ – John D McCalpin Jun 10 '20 at 13:32