Memory barriers in Parallel.For

Question

Microsoft's documention of Parallel.For contains the following method:

static void MultiplyMatricesParallel(double[,] matA, double[,] matB, double[,] result)
{
    int matACols = matA.GetLength(1);
    int matBCols = matB.GetLength(1);
    int matARows = matA.GetLength(0);

    // A basic matrix multiplication.
    // Parallelize the outer loop to partition the source array by rows.
    Parallel.For(0, matARows, i =>
    {
        for (int j = 0; j < matBCols; j++)
        {
            double temp = 0;
            for (int k = 0; k < matACols; k++)
            {
                temp += matA[i, k] * matB[k, j];
            }
            result[i, j] = temp;
        }
    }); // Parallel.For
}

In this method, potentially multiple threads read values from matA and matB, which were both created and initialized on the calling thread, and potentially multiple threads write values to result, which is later read by the calling thread. Within the lambda passed to Parallel.For, there is no explicit locking around the array reads and writes. Because this example comes from Microsoft, I assume it's thread-safe, but I'm trying to understand what's going on behind the scenes to make it thread-safe.

To the best of my understanding from what I've read and other questions I've asked on SO (for example this one), several memory barriers are needed to make this all work. Those are:

a memory barrier on the calling thread after creating and intializing matA and matB,
a memory barrier on each non-calling thread before reading values from matA and matB,
a memory barrier on each non-calling thread after writing values to result, and
a memory barrier on the calling thread before reading values from result.

Have I understood this correctly?

If so, does Parallel.For do all of that somehow? I went digging in the reference source but had trouble following the code. I didn't see any lock blocks or MemoryBarrier calls.

I **expect** that Parallel.For() has a memory barrier at the start and end of it, have you looked at the source code for Parallel.For()? — Ian Ringrose, Jan 14 '17 at 19:58
@IanRingrose, yes, as mentioned and linked to in my question. I didn't find any `MemoryBarrier` calls or lock blocks. — adv12, Jan 14 '17 at 19:59
But what about in the methods it calls, I expect it sits on top of the task system, and that somewhere in the task system are memory barriers where task start and end. — Ian Ringrose, Jan 14 '17 at 20:00
Implicit about using a framework method is trusting that it gets it right. This does get tested millions of times every day. Those barriers exist, you'll just never find them back. Buried in the thread pool implementation and the OS. — Hans Passant, Jan 14 '17 at 22:59
@HansPassant I do trust that MS gets it right, but I've written threaded code that is wrong, and understanding exactly what is required to get it right helps me when I have to do it myself, and more importantly helps me know exactly what I can count on MS to do for me vs what locking, etc I still need to implement when using Microsoft's patterns and libraries. — adv12, Jan 14 '17 at 23:05

score 6 · Answer 1 · edited Jan 14 '17 at 22:30

6

Since the array is already created, writing or reading from it won't cause any resizes. Also, the code itself prevents reading/writing the same position in the array.

Bottom line is that the code always can calculate the position to read and write in the array and those calls never cross each other. Hence, it is thread-safe.

edited Jan 14 '17 at 22:30

VMAtm

27,943
17
79
125

answered Jan 14 '17 at 19:09

Patrick Hofman

153,850
22
249
325

I understand that the threads don't step on each other, and I agree that this means you don't have to worry about mutual exclusivity, but I'm still confused (as ever) about memory visibility. If you read Jon Skeet's answer to the linked question, particularly the code block, you'll see that if a method does reads and writes without acquiring a lock, it doesn't necessarily see "fresh" data, and other methods, even if they *do* lock on something, don't necessarily see writes from the "bad" method. (Continued in next comment...) – adv12 Jan 14 '17 at 19:15
Here, the arrays are initialized on the calling thread without any sort of locking, so I'm wondering what guarantees that they can be read safely from one of the threads in the `Parallel.For`. But I may just have a very deep misunderstanding of memory visibility concerns. – adv12 Jan 14 '17 at 19:16
The memory read doesn't change anywhere, so it is static. There is no old and new data. – Patrick Hofman Jan 14 '17 at 19:17
1

@adv12 Since `i` var is a parameter and on the local stack, and all other iteration values are declared inside the method, everything is happening locally, there can not be a cross over as Patrick Hofman says – Icepickle Jan 14 '17 at 19:17
@Icepickle, see my comments. I'm not worried about cross over; I'm worried about memory visibility. – adv12 Jan 14 '17 at 19:18
@Icepickle, yes, each cell in the array was filled by one thread or another, but as I understand it, without proper locking/memory fences/whatever, cells written by one thread are not necessarily visible to others. – adv12 Jan 14 '17 at 19:22
1

Who cares? We can assume only the calling thread will ever read the array. – Patrick Hofman Jan 14 '17 at 19:23
2

@PatrickHofman, right, and what I'm asking is, what guarantees that the calling thread sees the values written by all the parallel threads? I think, at some level, there must be locking or memory barriers going on; I want to know where and how. – adv12 Jan 14 '17 at 19:25
I don't think that will ever be the case. [See here](http://stackoverflow.com/a/3556877/993547). And I think it can only happen when running multiple threads. – Patrick Hofman Jan 14 '17 at 19:36
1

@adv12 Parallel library waits for all the tasks to complete, all writes are synchronous, so CLR guarantees that all writes are final. In scenarios when you need the memory barrier you do have a situation you don't know when the operation happpen, here you know exactly when. – VMAtm Jan 15 '17 at 13:06
I believe the memory barrier you are looking for is the pendingTask.Wait() call from System.Threading.Tasks.TaskReplicator.Replica.Wait. According to https://www.albahari.com/threading/part4.aspx that is a Full Fence and thus ensures thread safety. – andresp Jun 21 '23 at 10:52

score 6 · Answer 2 · answered Jan 15 '17 at 15:31

6

The memory barriers you are looking for are inside the task scheduler.

ParallelFor breaks the work up into tasks, and a work-stealing scheduler executes those task. The minimal memory barriers required for a work-stealing scheduler are:

A "release" fence after a task is created.
An "acquire" fence when a task is stolen.
A "release" fence when a stolen task completes.
An "acquire" fence for the thread that is waiting on the task.

Look here for where 1 is implied by the atomic ("Interlocked") operations used to enqueue a task. Look here where 2 is implied by atomic operations, volatile reads, and/or locks when a task is stolen.

I have not been able to track down where 3 and 4 are. 3 and 4 are likely implied by some kind of atomic join counter.

answered Jan 15 '17 at 15:31

Arch D. Robison

3,829
2
16
26

Thanks for the research effort and for answering the actual question! I already accepted Henk's answer, but you have my upvote. I'll admit it's a bit over my head, but each time I ask one of these questions I come away understanding threading a little better. – adv12 Jan 15 '17 at 19:38
@adv12 there is no reason why you could not switch the accepted answer, definitely not if the new answer is more appropriate, also think of future readers who might be interested in the correct answer for your question – Icepickle Jan 16 '17 at 16:54

score 3 · Accepted Answer · answered Jan 14 '17 at 20:49

3

Inside the Threads (actually: Tasks) access to matA and matB is read-only, result is write-only.

Parallel reading is inherently thread-safe, the writing is thread-safe because the i variable is unique for each Task.

There is no need for memory barriers in this piece of code (other then before/after the entire Parallel.For but those can be assumed).

Form you numbered items,
1) and 4) are implied by the Parallel.For()
2) and 3) are simply not needed.

answered Jan 14 '17 at 20:49

H H

263,252
30
330
514

2

Thanks for the answer. If (2) and (3) aren't needed, what guarantees that the reads done on a parallel thread see the initialized values (rather than some old data they happen to have in a cache line, for instance), and what guarantees that the writes done on parallel threads become visible to the calling thread (other than the implementation detail that .NET on Windows happens to treat writes as volatile)? Basically, I was under the impression that memory visibility between two threads depends on memory barriers on *both* threads, one after writing and one before reading. – adv12 Jan 14 '17 at 21:04
The scope of the sample is that the Parallel.For() is the only parallelism going on. If you're thinking of 'outside' threads that write to matA or matB : there is no scenario in which that could yield useful results. – H H Jan 14 '17 at 21:12
I don't think I'm thinking of 'outside' threads, just under the impression that the two threads (calling thread and thread pool thread in `Parallel.For`) need to work together to guarantee that the reading thread sees what was written by the writing thread. I thought the process went: writing thread writes, writing thread uses memory barrier, reading thread uses memory barrier, reading thread reads. I think you're saying that's more memory barriers than are necessary; I'm trying to understand why. – adv12 Jan 14 '17 at 21:19
matA and matB are filled before the .For() even starts and they never change. Why would they need a barrier or any kind of synchronization? – H H Jan 14 '17 at 21:23
2

It would be very encouraging to know that they don't. But I was under the impression that they might appear filled to the calling thread but not to the parallel threads, due to CPU caches, etc. and that locking or memory barriers were required to guarantee that the parallel threads saw what the calling thread wrote to them, even though that happened before the threading. Like I say, I'd love to be convinced that's not the case. If it is, threaded code is just way too complicated for someone of my intelligence to be writing. – adv12 Jan 14 '17 at 21:28
_"due to CPU caches, etc"_ - start by trusting your hardware, your OS and your Framework. If this was the case nothing would ever work. – H H Jan 15 '17 at 09:45

score 2 · Answer 4 · answered Jan 14 '17 at 22:30

I think that you're really impressed by idea of memory barriers, but I really can't understand your concerns. Let's take a look for the code you've investigated:

3 arrays are initiated and filled with values in main thread. So it's like you assign a variable a value and call a method - the CLR ensures that your method gets fresh value for the arguments. The possible issue here could take a place if the initialization is done in background and/concurrently by some other thread. In this case you're right and you need some synchronization constructs here, either memory barrier or lock statement or other techniques.
A code for a parallel execution gets all values from 0 to matARows and creates a task array for them. You need to understand two different ways to parallelize your code: by operations and by data. Right here we have a multiple rows with the same lambda operation for them. The assignments for temp variable are not shared, so they are thread-safe and no memory barrier is needed, as there are no old and new values. Again, as in first place, if some other thread updates initial matrices, you need a synchronization constructs here.
Parallel.For ensures that all the tasks are done (ran to completion, are cancelled or faulted) until it proceeds to the next statements, so the code inside loop will be executed as a normal method. Why don't you need a barrier here? Because all the write operations are done on different rows, and there is no intersection between them, so it is a data parallelism. However, as in other cases, if other threads needed for a new value from some of loop iterations, you still need a synchronization. So this code is thread safe because it geometrically parallel on data, and don't create a race conditions.

This example is very simple, and real algorithm in general needs more complicated logic. You can investigate various methods forproving the code to be thread safe without using a synchronization for the codeto be a lock-free.

Memory barriers in Parallel.For

4 Answers4