Are there any guarantees about how card to host writes from a PCIe device targeting regular memory are implemented from a software process' perspective, where a single TLP write is fully contained within a single CPU cache-line?
I'm wondering about a case where my device may write some number of words of data followed by a byte to indicate that the structure is now valid (for example an event completion), for example:
struct PCIE_COMPLETION_T {
uint64_t data_a;
uint64_t data_b;
uint64_t data_c;
uint64_t data_d;
uint8_t valid;
} alignas(SYSTEM_CACHE_LINE_SIZE);
Can I use a single TLP to write this structure, such that when software sees the valid member change to 1 (having been previously cleared to zero by software), then will the other data members will also reflect the values that I had written and not a previous value?
Currently I'm performing 2 writes, first writing the data and secondly marking it as valid, which doesn't have any apparent race conditions but does of course add unwanted overhead.
The most relevant question I can see on this site seems to be Are writes on the PCIe bus atomic? although this appears to relate to the relative ordering of TLPs.
Perusing the PCIe 3.0 specification, I didn't find anything that seemed to explicitly cover my concerns, I don't think that I need AtomicOps particularly. Given that I'm only concerned about interactions with x86-64 systems, I also dug through the Intel architecture guide but also came up no clearer.
Instinctively it seems that it should be possible for such a write to be perceived atomically -- especially as it is said to be a transaction -- but equally I can't find much in the way of documentation explicitly confirming that view (nor am I quite sure what I'd need to look at, probably the CPU vendor?). I also wonder if such a scheme can be extended over multiple cachelines -- ie if the valid sits on a second cacheline written from the same TLP transaction can I be assured that the first will be perceived no later than the second?