Why doesn't this code scale linearly?

Question

I wrote this SOR solver code. Don't bother too much what this algorithm does, it is not the concern here. But just for the sake of completeness: it may solve a linear system of equations, depending on how well conditioned the system is.

I run it with an ill conditioned 2097152 rows sparce matrix (that never converges), with at most 7 non-zero columns per row.

Translating: the outer do-while loop will perform 10000 iterations (the value I pass as max_iters), the middle for will perform 2097152 iterations, split in chunks of work_line, divided among the OpenMP threads. The innermost for loop will have 7 iterations, except in very few cases (less than 1%) where it can be less.

There is data dependency among the threads in the values of sol array. Each iteration of the middle for updates one element but reads up to 6 other elements of the array. Since SOR is not an exact algorithm, when reading, it can have any of the previous or the current value on that position (if you are familiar with solvers, this is a Gauss-Siedel that tolerates Jacobi behavior on some places for the sake of parallelism).

typedef struct{
    size_t size;

    unsigned int *col_buffer;
    unsigned int *row_jumper;
    real *elements;
} Mat;

int work_line;

// Assumes there are no null elements on main diagonal
unsigned int solve(const Mat* matrix, const real *rhs, real *sol, real sor_omega, unsigned int max_iters, real tolerance)
{
    real *coefs = matrix->elements;
    unsigned int *cols = matrix->col_buffer;
    unsigned int *rows = matrix->row_jumper;
    int size = matrix->size;
    real compl_omega = 1.0 - sor_omega;
    unsigned int count = 0;
    bool done;

    do {
        done = true;
        #pragma omp parallel shared(done)
        {
            bool tdone = true;

            #pragma omp for nowait schedule(dynamic, work_line)
            for(int i = 0; i < size; ++i) {
                real new_val = rhs[i];
                real diagonal;
                real residual;
                unsigned int end = rows[i+1];

                for(int j = rows[i]; j < end; ++j) {
                    unsigned int col = cols[j];
                    if(col != i) {
                        real tmp;
                        #pragma omp atomic read
                        tmp = sol[col];

                        new_val -= coefs[j] * tmp;
                    } else {
                        diagonal = coefs[j];
                    }
                }

                residual = fabs(new_val - diagonal * sol[i]);
                if(residual > tolerance) {
                    tdone = false;
                }

                new_val = sor_omega * new_val / diagonal + compl_omega * sol[i];
                #pragma omp atomic write
                sol[i] = new_val;
            }

            #pragma omp atomic update
            done &= tdone;
        }
    } while(++count < max_iters && !done);

    return count;
}

As you can see, there is no lock inside the parallel region, so, for what they always teach us, it is the kind of 100% parallel problem. That is not what I see in practice.

All my tests were run on a Intel(R) Xeon(R) CPU E5-2670 v2 @ 2.50GHz, 2 processors, 10 cores each, hyper-thread enabled, summing up to 40 logical cores.

On my first set runs, work_line was fixed on 2048, and the number of threads varied from 1 to 40 (40 runs in total). This is the graph with the execution time of each run (seconds x number of threads):

Time x Threads, work line: 2048

The surprise was the logarithmic curve, so I thought that since the work line was so large, the shared caches were not very well used, so I dug up this virtual file /sys/devices/system/cpu/cpu0/cache/index0/coherency_line_size that told me this processor's L1 cache synchronizes updates in groups of 64 bytes (8 doubles in the array sol). So I set the work_line to 8:

Time x Threads, work line: 8

Then I thought 8 was too low to avoid NUMA stalls and set work_line to 16:

Time x Threads, work line: 16

While running the above, I thought "Who am I to predict what work_line is good? Lets just see...", and scheduled to run every work_line from 8 to 2048, steps of 8 (i.e. every multiple of the cache line, from 1 to 256). The results for 20 and 40 threads (seconds x size of the split of the middle for loop, divided among the threads):

Time x Work Line, threads: 40, 20

I believe the cases with low work_line suffers badly from cache synchronization, while bigger work_line offers no benefit beyond a certain number of threads (I assume because the memory pathway is the bottleneck). It is very sad that a problem that seems 100% parallel presents such bad behavior on a real machine. So, before I am convinced multi-core systems are a very well sold lie, I am asking you here first:

How can I make this code scale linearly to the number of cores? What am I missing? Is there something in the problem that makes it not as good as it seems at first?

Update

Following suggestions, I tested both with static and dynamic scheduling, but removing the atomics read/write on the array sol. For reference, the blue and orange lines are the same from the previous graph (just up to work_line = 248;). The yellow and green lines are the new ones. For what I could see: static makes a significant difference for low work_line, but after 96 the benefits of dynamic outweighs its overhead, making it faster. The atomic operations makes no difference at all.

Time x Work Line, no atomic vs atomic, dynamic vs static

I'm not so familiar with the SOR/Gauss–Seidel method but with matrix multiplication or with Cholesky Decomposition the only way you're going to get good scaling is by using loop tiling in order to reuse data while it's still in the cache. See http://stackoverflow.com/questions/22479258/cholesky-decomposition-with-openmp. Otherwise it's memory bound. — Z boson, Oct 06 '14 at 19:50
While I'm not familiar with the algorithm, a quick glance of that inner loop suggests that you probably have some very poor spatial memory locality. (as is typically the case for sparse linear algebra) In that case, you're probably limited by memory access. — Mysticial, Oct 06 '14 at 19:58
What's the time complexity of SOR? http://www.cs.berkeley.edu/~demmel/cs267/lecture24/lecture24.html#link_4 O(N^3/2)? With Matrix Mult the computations go as N^3 whereas the reads go as N^2 so that's why it can scale well. So unless the number of computations is much larger than the reads then it will be memory bound. Many basic alogrithms appear to scale well if you ignore the fact that the cores are fast and the main memory is slow. BLAS level 2 (e.g matrix*vec) would scale well ignoring slow memory. It's only BLAS level 3 (O(N^3) e.g GEMM, Choleksy,...)that scales well with slow memory. — Z boson, Oct 06 '14 at 20:07
This specific case: O(N*7*10000) = O(N), where N is the side of the square matrix. This is so because the matrix is sparse. — lvella, Oct 06 '14 at 20:10
Out of curiosity, what happens if you use `schedule(static)`? — Z boson, Oct 06 '14 at 20:13
About the cache locality, the matrix came from a uniform finite difference discretization of a 3D partial differential equation. In the middle loop, the thread has to access 5 different "places" of the array `sol`. 1st: `sol[i-1]`, `sol[i]` and `sol[i+1]`; 2nd: `sol[i-128]`; 3rd: `sol[i+128]`; 4th: `sol[i-1024]`; 5th: `sol[i+1024]`. So, it is not a mess of scatter-gather memory access. — lvella, Oct 06 '14 at 20:18
I think Mark Lakata's suggestion is good. Run a profiler and find out if you're memory bound. Then ask yourself if it's possible to improve the caching for example with loop tiling. Another suggestion is to do this with the Intel MKL. If MKL does not scale that that's a good indication that it can't be done. — Z boson, Oct 06 '14 at 22:02
This is a really nicely written question I think, and I shudder to imagine how awesome the answer could be. +1 for a good question, and here's looking forward to a potentially great explanation :) — BrianH, Oct 07 '14 at 02:14
Realize that the hyper-threads do not offer additional parallelism. They offer a very fast context switch, which is useful if your threads block on system calls. But, your threads aren't making any system calls. — jxh, Oct 07 '14 at 04:09
I've heard they also offer parallelism on the superscalar pipelines, so if both threads has low instruction level parallelism, they can run simultaneously on different instruction pipelines. — lvella, Oct 07 '14 at 13:31
But that is useless here, you should employ some loop tiling. — Vladimir F Героям слава, Oct 07 '14 at 17:37
The default topology on Linux with Intel is scattered. This means in your case even threads correspond to one one node and odd threads to another. I think if you tried `export GOMP_CPU_AFFINITY="0 2 4 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62"` and `export OMP_NUM_THREADS=20` your code would run on one node (one socket). — Z boson, Oct 08 '14 at 08:41
@Zboson, that's `export GOMP_CPU_AFFINITY="0-62:2"` for short. As for the topology, core numbering is set by the BIOS and the Linux kernel finds it by parsing the corresponding MP ACPI table(s) (MADT? I won't bet on it though). Most of our dual-socket Intel machines from Bull have cores in a single package numbered consecutively. — Hristo Iliev, Oct 10 '14 at 07:15
@HristoIliev, I was not aware that the topology is set by the BIOS. I'm glad you let me know that some of your dual-socket machines don't have a scattered topology. This means I can't just assume the topology is scattered. I have to ask what the topology is for each machine. I wish OpenMP had a function to find the topology so I did not have to use an external tool to find the topology to bind the threads. — Z boson, Oct 10 '14 at 07:24
OpenMP 4.0 has the notion of _places_ which abstracts the system topology but no API is exposed that allows one to query them. Therefore, the `hwloc` library is your best friend in the pursue of platform-independent topology discovery. — Hristo Iliev, Oct 10 '14 at 08:34

score 6 · Accepted Answer · edited May 23 '17 at 10:25

The sparse matrix vector multiplication is memory bound (see here) and it could be shown with a simple roofline model. Memory bound problems benefit from higher memory bandwidth of multisocket NUMA systems but only if the data initialisation is done in such a way that the data is distributed among the two NUMA domains. I have some reasons to believe that you are loading the matrix in serial and therefore all its memory is allocated on a single NUMA node. In that case you won't benefit from the double memory bandwidth available on a dual-socket system and it really doesn't matter if you use schedule(dynamic) or schedule(static). What you could do is enable memory interleaving NUMA policy in order to have the memory allocation spread among both NUMA nodes. Thus each thread would end up with 50% local memory access and 50% remote memory access instead of having all threads on the second CPU being hit by 100% remote memory access. The easiest way to enable the policy is by using numactl:

$ OMP_NUM_THREADS=... OMP_PROC_BIND=1 numactl --interleave=all ./program ...

OMP_PROC_BIND=1 enables thread pinning and should improve the performance a bit.

I would also like to point out that this:

done = true;
#pragma omp parallel shared(done)
{
    bool tdone = true;

    // ...

    #pragma omp atomic update
    done &= tdone;
}

is a probably a not very efficient re-implementation of:

done = true;
#pragma omp parallel reduction(&:done)
{
    // ...
        if(residual > tolerance) {
            done = false;
        }
    // ...
}

It won't have a notable performance difference between the two implementations because of the amount of work done in the inner loop, but still it is not a good idea to reimplement existing OpenMP primitives for the sake of portability and readability.

Thanks for the tip. I am just leraning OpenMP and had trouble understanding the reduction thing. — lvella, Oct 08 '14 at 19:43
Made a huge difference the `numactl` thing. I will take a time later to use libnuma to split the work properly between NUMA sockets and set the threads affinity accordingly. — lvella, Oct 08 '14 at 19:54
@lvella, could you please update your question again with the results after using `numactl`? I am very curious to see the results. — Z boson, Oct 09 '14 at 06:00

Mark Lakata · Answer 2 · 2014-10-07T00:56:40.170

5

Try running the IPCM (Intel Performance Counter Monitor). You can watch memory bandwidth, and see if it maxes out with more cores. My gut feeling is that you are memory bandwidth limited.

As a quick back of the envelope calculation, I find that uncached read bandwidth is about 10 GB/s on a Xeon. If your clock is 2.5 GHz, that's one 32 bit word per clock cycle. Your inner loop is basically just a multiple-add operation whose cycles you can count on one hand, plus a few cycles for the loop overhead. It doesn't surprise me that after 10 threads, you don't get any performance gain.

edited Oct 07 '14 at 00:56

answered Oct 06 '14 at 20:14

Mark Lakata

19,989
5
106
123

I am convincing the sysadmin to allow me to have r/w permission on `/dev/cpu/*/msr`... – lvella Oct 07 '14 at 14:13
2

This algorithm is actually well known to be memory bandwith limited. – Vladimir F Героям слава Oct 07 '14 at 17:34
Not to mention that the potential cache miss on `sol[col]` can only make things worse. That probably doesn't really matter to the CPU if all the cores are already stalling on memory. But from the bandwidth perspective, such a cache miss will eat up a cacheline of bandwidth. – Mysticial Oct 07 '14 at 18:07
@VladimirF, I don't doubt that the OPs implementation of this algorithm is memory bandwidth limited but do you have a source for your statement that the algorithm is memory bound in general? There is some discussion of a parallel version at http://www.cs.berkeley.edu/~demmel/cs267/lecture24/lecture24.html#link_4. My first implementation of Cholesky decomposition did not scale well but after a lot of thinking I got it to scale well. – Z boson Oct 07 '14 at 18:30
Well, I know it only from very sparse matrices, but there is a lot of theory and literature about loop tiling and other tricks for G.-S. and SOR to improve the cache reuse. They are used because of the memory bandwith limit. – Vladimir F Героям слава Oct 07 '14 at 18:34
I won't be able to run it on the same Xeon machine, but on my desktop machine L3 cache miss is 32%, and memory read rate is 15.41 GBytes/ 3111 Mticks. How Mticks translates to seconds? Are those CPU clock ticks? – lvella Oct 07 '14 at 21:28
I'm not sure, but I believe that is Mega clock ticks. That would give you 12 GB/s, which is about the limit of memory.... so you are definitely memory bandwidth limited. – Mark Lakata Oct 08 '14 at 01:49
@lvella, It might be interesting to show your results for a single socket machine. My own code, https://stackoverflow.com/questions/25055604/matrix-multiplication-poor-efficiency-on-a-4-socket-numa-system, on a NUMA system did not scale nearly as well as I hoped the first time I tried it. Apparently you have to tune your code for NUMA and can't assume that just because your code scales well on a single socket system that it will scale well on a multi-socket system. – Z boson Oct 08 '14 at 08:23
@MarkLakata, his system is a two socket system so the bandwidth should be double of that of a single socket system. – Z boson Oct 08 '14 at 09:18
I take this answer. The memory bandwidth is the limit. My attempt to improve shared cache reuse, by making the threads to work closer together makes the problem worse, probably because of individual cache invalidation by other threads writings (other tests also shows that bigger `work_line` is much better for solver convergence). I suppose the code can be made better by coding for NUMA (spliting the jobs and memory access by processor socket), but I won't tackle this problem for now. – lvella Oct 08 '14 at 17:24

score 2 · Answer 3 · answered Oct 06 '14 at 19:45

2

Your inner loop has an omp atomic read, and your middle loop has an omp atomic write to a location that could be the same one read by one of the reads. OpenMP is obligated to ensure that atomic writes and reads of the same location are serialized, so in fact it probably does need to introduce a lock, even though there isn't any explicit one.

It might even need to lock the whole sol array unless it can somehow figure out which reads might conflict with which writes, and really, OpenMP processors aren't necessarily all that smart.

No code scales absolutely linearly, but rest assured that there are many codes that do scale much closer to linearly than yours does.

answered Oct 06 '14 at 19:45

John Bollinger

160,171
8
81
157

I don't think there is a real software lock there. I haven't looked at the assembly, but they are most likely atomic read/write available on instruction level. Anyway, I'll rerun a sparser version of case 3 without atomic read/write. For bigger `work_line`, it makes no difference (I ran a test on a different machine with 4 threads) and it makes sense because a clash is very unlikely. For smaller `work_line`, it may be relevant. See this: https://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Atomic-Builtins.html – lvella Oct 06 '14 at 19:48
1

`atomic read` and `atomic write` on x86 are implemented using the `lock` instruction prefix, i.e. no heavy-weight software locks. – Hristo Iliev Oct 08 '14 at 06:10

score 2 · Answer 4 · answered Oct 06 '14 at 19:57

2

I suspect you are having caching issues. When one thread updates a value in the sol array, it invalids the caches on other CPUs that are storing that same cache line. This forces the caches to be updated, which then leads to the CPUs stalling.

answered Oct 06 '14 at 19:57

dohashi

1,771
8
12

score 1 · Answer 5 · answered Oct 06 '14 at 20:28

Even if you don't have an explicit mutex lock in your code, you have one shared resource between your processes: the memory and its bus. You don't see this in your code because it is the hardware that takes care of handling all the different requests from the CPUs, but nevertheless, it is a shared resource.

So, whenever one of your processes writes to memory, that memory location will have to be reloaded from main memory by all other processes that use it, and they all have to use the same memory bus to do so. The memory bus saturates, and you have no more performance gain from additional CPU cores that only serve to worsen the situation.

Why doesn't this code scale linearly?

5 Answers5

Linked