On a 64 bit machine, can I safely operate on individual bytes of a 64 bit quadword in parallel?

Question

Background

I am doing parallel operations on rows and columns in images. My images are 8 bit or 16 bit pixels and I'm on a 64 bit machine. When I do operations on columns in parallel, two adjacent columns may share the same 32 bit int or 64 bit long. Basically, I want to know whether I can safely operate on individual bytes of the same quadword in parallel.

Minimal Test

I wrote a minimal test function that I have not been able to make fail. For each byte in a 64 bit long, I concurrently perform successive multiplications in a finite field of order p. I know that by Fermat's little theorem a^(p-1) = 1 mod p when p is prime. I vary the values a and p for each of my 8 threads, and I perform k*(p-1) multiplications of a. When the threads finish each byte should be 1. And in fact, my test cases pass. Each time I run, I get the following output:

8
101010101010101
101010101010101

My system is Linux 4.13.0-041300-generic x86_64 with an 8 core Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz. I compiled with g++ 7.2.0 -O2 and examined the assembly. I added the assembly for the "INNER LOOP" and commented it. It seems to me that the code generated is safe because the stores are only writing the lower 8 bits to the destination instead of doing some bitwise arithmetic and storing to the entire word or quadword. g++ -O3 generated similar code.

Question:

I want to know if this code is always thread-safe, and if not, in what conditions would it not be. Maybe I am being very paranoid, but I feel that I would need to operate on quadwords at a time in order to be safe.

#include <iostream>
#include <pthread.h>

class FermatLTParams
{
public:
    FermatLTParams(unsigned char *_dst, unsigned int _p, unsigned int _a, unsigned int _k)
        : dst(_dst), p(_p), a(_a), k(_k) {}

    unsigned char *dst;
    unsigned int p, a, k;
};

void *PerformFermatLT(void *_p)
{  
    unsigned int j, i;
    FermatLTParams *p = reinterpret_cast<FermatLTParams *>(_p);
    for(j=0; j < p->k; ++j)
    {    
        //a^(p-1) == 1 mod p

        //...BEGIN INNER LOOP
        for(i=1; i < p->p; ++i)
        {
            p->dst[0] = (unsigned char)(p->dst[0]*p->a % p->p);
        }
        //...END INNER LOOP

        /* gcc 7.2.0 -O2  (INNER LOOP)

        .L4:
            movq    (%rdi), %r8             # r8 = dst
            xorl    %edx, %edx              # edx = 0
            addl    $1, %esi                # ++i
            movzbl  (%r8), %eax             # eax (lower 8 bits) = dst[0]
            imull   12(%rdi), %eax          # eax =  a * eax
            divl    %ecx                    # eax = eax / ecx;   edx = eax % ecx    
            movb    %dl, (%r8)              # dst[0] = edx (lower 8 bits)
            movl    8(%rdi), %ecx           # ecx = p
            cmpl    %esi, %ecx              # if (i < p)
            ja      .L4                     #   goto L4
        */

    }
    return NULL;
}

int main(int argc, const char **argv)
{
    int i;
    unsigned long val = 0x0101010101010101; //a^0 = 1
    unsigned int k = 10000000;
    std::cout << sizeof(val) << std::endl;
    std::cout << std::hex << val << std::endl;
    unsigned char *dst = reinterpret_cast<unsigned char *>(&val);
    pthread_t threads[8];
    FermatLTParams params[8] = 
    { 
        FermatLTParams(dst+0, 11, 5, k),
        FermatLTParams(dst+1, 17, 8, k),
        FermatLTParams(dst+2, 43, 3, k),
        FermatLTParams(dst+3, 31, 4, k),
        FermatLTParams(dst+4, 13, 3, k),
        FermatLTParams(dst+5, 7, 2, k),
        FermatLTParams(dst+6, 11, 10, k),
        FermatLTParams(dst+7, 13, 11, k)
    };

    for(i=0; i < 8; ++i)
    {
        pthread_create(threads+i, NULL, PerformFermatLT, params+i);
    }
    for(i=0; i < 8; ++i)
    {
        pthread_join(threads[i], NULL);
    }

    std::cout << std::hex << val << std::endl;
    return 0;
}

C++ standardese doesn't particularly care about the word-size of the system. As long as the different threads are never racing over the same byte, it is fine. (defined behavior) The casting and type-punning you're doing is also fine since the strict-aliasing rule has an exception for `char`. However, the fact that you have multiple threads writing to adjacent bytes in memory is going to lead to *performance* issues even if there is no race condition that would affect the *correctness* of the code. — Mysticial, Oct 24 '17 at 17:48
Thanks @Mystical. I'll make sure to keep the performance issues in mind. And when appropriate for performance ensure the threads operate on distinct words. — MFisherKDX, Oct 24 '17 at 18:43
@geza. Ok. So there's not much point in doing intensive column operations in parallel then. Try to rewrite with by iterating over the rows and saving, for each column, a "work in progress" state in an accumulator array -- which may fit it cache?? — MFisherKDX, Oct 24 '17 at 19:00
@MFisherKDX: I've retracted what I said. First I thought about atomic operations, but this problem doesn't have them. So please benchmark your solution against single threaded version, as you may find that you don't have too much to worry about. Plus, I don't know about your indended algorithm, but `div` is a slow operation, so maybe it hides some memory latency caused by cache line sharing. — geza, Oct 24 '17 at 19:06
@MFisherKDX: I've created a little benchmark, and asked a question: https://stackoverflow.com/questions/46919032/why-does-using-the-same-cache-line-from-multiple-threads-not-cause-serious-slowd — geza, Oct 24 '17 at 20:04
@MFisherKDX: I'm glad to help :) If you want to speed up `div`, check out this: https://stackoverflow.com/questions/45353629/repeated-integer-division-by-a-runtime-constant-value — geza, Oct 24 '17 at 20:22
Or better, use a template function so the compiler can generate 8 different functions for 8 different threads, with the parameters hard-coded for each. — Peter Cordes, Oct 30 '17 at 06:15
Good question, but it is an exact duplicate: As Mysticial points out, `char*` can alias anything, so what you're doing with `val` is exactly identical to declaring it as `char dst[8]` in the first place. ISO C++ requires that `char` has no padding, and that array elements are contiguous. ISO C++11 specifies a memory model that explicitly requires this to be safe (not UB for separate threads to write adjacent elements of a `char []`). (And gcc implements it correctly for x86). — Peter Cordes, Oct 30 '17 at 06:21
As I explained in my answer on [the 2nd duplicate I added](https://stackoverflow.com/questions/46721075/can-modern-x86-hardware-not-store-a-single-byte-to-memory/46818162#46818162), x86 definitely can do byte-stores without non-atomic RMW of the whole 32 or 64 bits. Alpha AXP can't (early versions of the ISA have only 32 and 64-bit stores), so a conforming C++ implementation would have to work around it with atomic RMW or by making `char` a 32-bit type. (You're confusing "byte" with `char`; ISO C++ doesn't guarantee this. It's true if talking about x86, but the x86-only answer is less fun.) — Peter Cordes, Oct 30 '17 at 06:23
And BTW, yes your idea of striding over columns is very bad. Can you transpose your image first in one thread before handing it out to multiple threads to do something expensive? If not, can you have each thread at least work on contiguous chunks of columns? e.g. thread1 does columns 0-63, thread2 does columns 64-127, so each thread eventually uses all the data in each cache line it touches. (Still potentially very bad if the image stride is a power of 2; you'll get conflict misses so the first row will be evicted when you touch the 9th row, or the 17th row (associativity + 1). — Peter Cordes, Oct 30 '17 at 06:26
Also, you should *really* assign `p->p` to a local, so the compiler can hoist it out of the loop. The compiler has to assume the `char *` writes could alias `*p`, so it has to reload all your parameters every time. This is why using `char *__restrict dst` helps, but it's also a good idea to assign things to private variables instead of referencing things that the compiler can't know aren't modified by function calls or pointed to by other pointers. — Peter Cordes, Oct 30 '17 at 06:28
@PeterCordes - the dupes are frankly terrible, they seem to be focusing on hardware issues or "that thing Bjarne said about machines without small loads/stores". The very simple answer is _Yes, the standard guarantees this in is safe in C++11 and beyond. Before that, it wasn't addressed by the standard but it will mostly work_. Neither dupe apparently even gives that answer? — BeeOnRope, Oct 30 '17 at 08:07
@BeeOnRope: KerrekSB's answer on the first dupe says that, doesn't it? — Peter Cordes, Oct 30 '17 at 08:47
@BeeOnRope Maybe I shouldn't have closed it as a duplicate if it required that much explanation in comments of why it's a dup. Especially because https://stackoverflow.com/questions/47008183/racing-condition-of-adjacent-data-in-shared-struct-in-c is also kind of a duplicate of this moreso than the two I linked. — Peter Cordes, Oct 30 '17 at 08:49
@PeterCordes It's the least bad, yup. Still it's the lowest rated answer (I UVed it tho), limits itself to `char` and provides wrong reasoning and no reference to the standard. The question is also unclear and refers to the confusing quote. This question is clear and has an exact answer. — BeeOnRope, Oct 30 '17 at 08:50
@PeterCordes - don't you dare imply that C and C++ somehow have anything in common. You will be lynched by the _which language to you mean_ and _C is not C++_ and _C++ is not C_ crowds :). In fact, I had to google around to check that `C11` apparently provides similar guarantees to C++11 to answer that. — BeeOnRope, Oct 30 '17 at 08:52
@BeeOnRope: good point, the other questions were C++11. I was confident that C11 and C++11 agreed on this part of the memory model, but I didn't check. I reopened this question. — Peter Cordes, Oct 30 '17 at 09:08
@PeterCordes frankly I'm totally fine with reasonable cross-pollination between C and C++ answers, but my impression is that most people who monitor these tags are not. Indeed, it might be good to keep them separate here: apparently the language in their respective standards are quite different in this area, even if the effect in this case is the same. — BeeOnRope, Oct 30 '17 at 09:14

CHKingsley · Accepted Answer · 2017-11-30T19:16:01.627

3

The answer is YES, you can safely operate on individual bytes of a 64-bit quadword in parallel, by different threads.

It is amazing that it works, but it would be a disaster if it did not. All hardware acts as if a core writing a byte in its own core marks not just that the cache line is dirty, but which bytes within it. When that cache line (64 or 128 or even 256 bytes) eventually gets written to main memory, only the dirty bytes actually modify the main memory. This is essential, because otherwise when two threads were working on independent data that happened to occupy the same cache line, they would trash each other's results.

This can be bad for performance, because the way it works is partly through the magic of "cache coherency," where when one thread writes a byte all the caches in the system that have that same line of data are affected. If they're dirty, they need to write to main memory, and then either drop the cache line, or capture the changes from the other thread. There are all kinds of different implementations, but it is generally expensive.

edited Nov 30 '17 at 19:16

answered Nov 30 '17 at 03:47

CHKingsley

321
2
9

1

You're right that it's safe, but your 2nd paragraph is *not* the hardware mechanism for cache coherency on x86 or any other common arch. What actually happens is that a CPU can't modify a cache line until it "owns" the line (Exclusive/Modified state of [MESI](https://en.wikipedia.org/wiki/MESI_protocol)) so no other cores have copies of the line, dirty or otherwise, while the store commits to L1D. Related: [Can modern x86 hardware not store a single byte to memory?](https://stackoverflow.com/questions/46721075/can-modern-x86-hardware-not-store-a-single-byte-to-memory). – Peter Cordes Dec 01 '17 at 01:39
HW that uses MESI doesn't need to track dirty/clean on a per-byte basis, and in fact most don't. DDR1/2/3/4 DRAM transfers in 64-byte bursts, same as the cache line size. Your last paragraph seems to be describing invalidation of other copies of the cache line before you can modify. This is the essential part: just tracking dirtyness on a per-byte basis would leave you with inconsistent caches if multiple cores *did* write the same byte. **Caches being coherent means that two caches can't have conflicting data for any of the bytes in the same line, so any valid copy is safe to read.** – Peter Cordes Dec 01 '17 at 01:46
Thanks. I think your comments explain better how the mechanism actually works. In any case, it is still essential that it work at a byte level, or we'd never get any SMP code to work. – CHKingsley Dec 02 '17 at 03:17

On a 64 bit machine, can I safely operate on individual bytes of a 64 bit quadword in parallel?

Background

Minimal Test

Question:

1 Answers1

Linked