5

I wanted to try and atomically reset 256 bits using something like this:

#include <x86intrin.h>
#include <iostream>
#include <array>
#include <atomic>

int main(){

    std::array<std::atomic<__m256i>, 10> updateArray;

    __m256i allZeros = _mm256_setzero_si256();

    updateArray[0].fetch_and(allZeros);
}

but I get compiler errors about the element not having fetch_and(). Is this not possible because 256 bit type is too large to guarantee atomicity?

Is there any other way I can implement this? I am using GCC.

If not, what is the largest type I can reset atomically- 64 bits?

EDIT: Could any AVX instructions perform the fetch-AND atomically?

user997112
  • 29,025
  • 43
  • 182
  • 361
  • I believe 64bit is largest on 64bit x64 platform – Severin Pappadeux Jun 20 '15 at 00:00
  • 1
    Surely an AVX vector-AND operation must be inherently atomic? – user997112 Jun 20 '15 at 00:01
  • 2
    256 bits are one half cache line, so it's certainly possible on x86 (it's certainly always atomic, on that platform coincidence). Whether the implementation of `std::atomic` supports it is another question... most people won't need that. There is a difference between what the hardware factually supports and what the C++ implementation supports logically. – Damon Jun 20 '15 at 00:02
  • This is highly platform dependent. On an 8-bit platform it will be 8 bits. On a 32-bit platform, it would be 32-bits. Usually the size is that of the processor's word size. Also has to do with the width of the data bus and address bus (inside and outside the processor). – Thomas Matthews Jun 20 '15 at 00:08
  • A native AVX instruction has two choices: a) it can go through the caches, which necessarily makes the operation atomic since only complete cache lines can be read and written or b) it can crash because of an unaligned memory access. Everything except write-combining writes and unaligned access is atomic on X86 (but by "coincidence" since the CPU works that way, not by contract). – Damon Jun 20 '15 at 00:10
  • @ThomasMatthews: It's not platform dependant. AVX implies "X86_64". – Damon Jun 20 '15 at 00:11
  • 1
    Note that "going through caches" does not guarantee exclusive access, and thus race conditions of who gets there first can happen - if two CPU's have the same data in cache, and each makes a different modification, "which one wins"? This is the main reason for having the atomic operations and the LOCK prefix. – Mats Petersson Jun 20 '15 at 08:42
  • @ThomasMatthews although Intel CPUs are 64-bit, they must have a 256-bit data bus to implement AVX? – user997112 Jun 20 '15 at 15:48
  • @user997112: Sorry, I don't remember the internal architecture of a 64-bit Intel processor. You should be able to look it up. One example I'm referring to are architectures that would have 32-bit internal bus and require 64-bit data to travel in 2 packets. So, look up the data sheet on a 64-bit Intel processor. – Thomas Matthews Jun 20 '15 at 17:28
  • @Damon - it is plainly false that 128-bit and 256-bit accesses are atomic. The Intel and AMD guides are explicit that only 64-bit accesses are atomic, and plenty of [real hardware](https://stackoverflow.com/a/7647825/149138) splits it up even when aligned. – BeeOnRope Sep 09 '17 at 20:58
  • @MatsPetersson - one of them would win. The cache itself is coherent and one most one CPU can have exclusive access to write a value at one time, but this applies only to operations of 64-bit and smaller. Of course, to do a meaningful "read then act" operation, either a RMW or a series of instructions you'll need more guarantees than just single read/write atomicity, which is where `lock` comes in. For greater than 64-bit you don't even have single read-write atomicity so the answer could be "both win" and the resultant value could be one never written by any CPU! – BeeOnRope Sep 11 '17 at 20:12

2 Answers2

6

So there are a few different things that need to be solved:

  1. What can the processor do?
  2. What do we mean by atomically?
  3. Can you make the compiler generate code for what the processor can do?
  4. Does the C++11/14 standard support that?

For #1 and #2:

In x86, there are instructions to do 8, 16, 32, 64, 128, 256 and 512 bit operations. One processor will [at least if the data is aligned to it's own size] perform that operation atomically. However, for an operation to be "true atomic", it also needs to prevent race conditions within the update of that data [in other words, prevent some other processor from reading, modifying and writing back that same location]. Aside from a small number of "implied lock" instructions, this is done by adding a "lock prefix" to a particular instruction - this will perform the right kind of cache-talk [technical term] to the other processors in the system to ensure that ONLY THIS processor can update this data.

We can't use VEX instructions with LOCK prefix (from Intel's manual)

Any VEX-encoded instruction with a LOCK prefix preceding VEX will #UD

You need a VEX prefix to use AVX instructions, and #UD means "undefined instruction" - in other words, the code will cause a processor exception if we try to execute it.

So, it is 100% certain that the processor can not do an atomic operation on 256 bits at a time. This answer discusses SSE instruction atomicity: SSE instructions: which CPUs can do atomic 16B memory operations?

#3 is pretty meaningless if the instruction isn't valid.

#4 - well, the standard supports std::atomic<uintmax_t>, and if uintmax_t happens to be 128 or 256 bits, then you could certainly do that. I'm not aware of any processor supporting 128 or higher bits for uintmax_t, but the language doesn't prevent it.

If the requirement for "atomic" isn't as strong as "need to ensure 100% certainly that no other processor updates this at the same time", then using regular SSE, AVX or AVX512 instructions would suffice - but there will be race conditions if you have two processor(cores) doing read/modify/write operations on the same bit of memory simultaneously.

The largest atomic operation on x86 is CMPXCHG16B, which will swap two 64-bit integer registers with the content in memory if the value in two other registers MATCH the value in memory. So you could come up with something that reads one 128-bit value, ands out some bits, and then stores the new value back atomically if nothing else got in there first - if that happened, you have to repeat the operation, and of course, it's not a single atomic and-operation either.

Of course, on other platforms than Intel and AMD, the behaviour may be different.

Community
  • 1
  • 1
Mats Petersson
  • 126,704
  • 14
  • 140
  • 227
  • Thanks for your answer. Did you mean to say "CMPXCHG16B swaps two 128-bits", rather than 64-bits? – user997112 Jun 20 '15 at 15:47
  • cmpxchg16b swaps two 64-bit register with two (consecutive) 64-bit memory locations, so one 128-bit unit at a time. I'm fairly sure the intention is to swap the forward and backward pointers in a double-linked list with 64-bit pointers without using mutex or similar locks. But of course, any 128-bit data item CAN be swapped with this. – Mats Petersson Jun 20 '15 at 18:55
  • @user997112 cmpxchg16b is not universally supported, some older CPUs has no such instruction, the only guaranteed atomicity is for 64bit types. Though if you limit yourself to CPUs with AVX, all of them has cmpxchg16b I believe – Severin Pappadeux Jun 21 '15 at 19:33
  • @SeverinPappadeux: Good point. I "grew up" on 64-bit AMD processors, so I keep forgetting that the first generation (or two) of Intel's didn't implement this instruction. – Mats Petersson Jun 22 '15 at 07:21
  • @MatsPetersson - even simple 128-bit or 256-bit read or write operations (not RMW) aren't guaranteed to be atomic, even if fully aligned. This occurs in practice for reasons such as (a) these reads or writes are often implemented internally as multiple reads or writes of a smaller size (this is common on x86 for a few generations when a larger size is introduced) or (b) the inter-socket transport may use a transport granule smaller than 128-bits (e.g., 64-bits) which can expose torn reads/writes even if the path to memory is full-width. See [here](https://stackoverflow.com/a/7647825/149138). – BeeOnRope Sep 09 '17 at 21:01
  • @BeeOnRope: That's what I was trying to say with the "However ..." sentence just after the discussion on operand sizes of instructions. – Mats Petersson Sep 10 '17 at 05:50
  • @MatsPetersson - I thought you were talking about RMW operations there. I think it is important to be clear that 8, 16, 32 and 64-bit read and write operations are atomic (even between processors), but the same guarantee doesn't apply to 128-bit or 256-bit reads or writes. For example, at the end you mention that you could read a 128-bit value and then use `CMPXCHG16B` to try to atomically swap it, but it important to note that _even the initial read_ has to be done with `lock CMPXCHG16B` or you risk a torn value. That's not true with 64-bit values, though. – BeeOnRope Sep 11 '17 at 20:08
0

The operation can only be atomic if the memory read/modify/write all happens as a single operation. e.g. lock and [mem], %rax is atomic. (Intel's insn ref manual explicitly says that the lock prefix does work with and to make it atomic.)

Since typical AVX instructions like VPAND can have memory source operands (combining a memory read with modifying a register), but not memory destination operands (read/modify/write), this whole idea isn't going to work.

Mats Petersson's answer does a good job explaining what you can do, but I just wanted to point out why normal AVX can't possibly be used as single-instruction atomic operations. You have to load, modify, and cmpxchange, and then try again if something else modified the memory between reading the load and the cmpexchange.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847