TL;DR: I understood MOVNTI operations are not ordered relative to the rest of the program, so SFENCE/MFENCE is needed. But are MOVNTI operations not ordered relative to other MOVNTI operations of the same thread?
Assuming I have a producer-consumer queue, and I want to use MOVNTI on producer side to avoid cache pollution.
(Has not actually observed cache pollution effect yet, so it is probably theory question for now)
So I'm replacing the following producer:
std::atomic<std::size_t> producer_index;
QueueElement queue_data[MAX_SIZE];
...
void producer()
{
for (;;)
{
...
queue_data[i].d1 = v1;
queue_data[i].d2 = v2;
...
queue_data[i].dN = vN;
producer_index.store(i, std::memory_order_release);
}
}
With the following:
void producer()
{
for (;;)
{
...
_mm_stream_si64(&queue_data[i].d1, v1);
_mm_stream_si64(&queue_data[i].d2, v2);
...
_mm_stream_si64(&queue_data[i].dN, vN);
_mm_sfence();
producer_index.store(i, std::memory_order_release);
}
}
Notice I added _mm_sfence
, which would wait until "non-temporal" operation results become observable.
If I don't add it, consumer
may observe producer_index
before queue_data
changes.
But what if I write index with _mm_stream_si64
too?
std::size_t producer_index_value;
std::atomic_ref<std::size_t> producer_index { producer_index_value };
void producer()
{
for (;;)
{
...
_mm_stream_si64(&queue_data[i].d1, v1);
_mm_stream_si64(&queue_data[i].d2, v2);
...
_mm_stream_si64(&queue_data[i].dN, vN);
_mm_stream_si64(&producer_index_value, i);
}
}
According to my reading of Intel manuals, this shouldn't work, as non-temporal store has relaxed ordering.
But didn't they say "relaxed" only to make non-temporal operation not ordered against the rest of the program?
Maybe they are ordered within themselves, so the producer
still would work as expected?
And if MOVNTI is truly relaxed, so that the latest code is incorrect, what is the reason for memory writes to be reordered?