Optimizing the backward solve for a sparse lower triangular linear system

Question

I have the compressed sparse column (csc) representation of the n x n lower-triangular matrix A with zeros on the main diagonal, and would like to solve for b in

(A + I)' * x = b

This is the routine I have for computing this:

void backsolve(const int*__restrict__ Lp,
               const int*__restrict__ Li,
               const double*__restrict__ Lx,
               const int n,
               double*__restrict__ x) {
  for (int i=n-1; i>=0; --i) {
      for (int j=Lp[i]; j<Lp[i+1]; ++j) {
          x[i] -= Lx[j] * x[Li[j]];
      }
  }
}

Thus, b is passed in via the argument x, and is overwritten by the solution. Lp, Li, Lx are respectively the row, indices, and data pointers in the standard csc representation of sparse matrices. This function is the top hotspot in the program, with the line

x[i] -= Lx[j] * x[Li[j]];

being the bulk of the time spent. Compiling with gcc-8.3 -O3 -mfma -mavx -mavx512f gives

backsolve(int const*, int const*, double const*, int, double*):
        lea     eax, [rcx-1]
        movsx   r11, eax
        lea     r9, [r8+r11*8]
        test    eax, eax
        js      .L9
.L5:
        movsx   rax, DWORD PTR [rdi+r11*4]
        mov     r10d, DWORD PTR [rdi+4+r11*4]
        cmp     eax, r10d
        jge     .L6
        vmovsd  xmm0, QWORD PTR [r9]
.L7:
        movsx   rcx, DWORD PTR [rsi+rax*4]
        vmovsd  xmm1, QWORD PTR [rdx+rax*8]
        add     rax, 1
        vfnmadd231sd    xmm0, xmm1, QWORD PTR [r8+rcx*8]
        vmovsd  QWORD PTR [r9], xmm0
        cmp     r10d, eax
        jg      .L7
.L6:
        sub     r11, 1
        sub     r9, 8
        test    r11d, r11d
        jns     .L5
        ret
.L9:
        ret

According to vtune,

vmovsd  QWORD PTR [r9], xmm0

is the slowest part. I have almost no experience with assembly, and am at a loss as to how to further diagnose or optimize this operation. I have tried compiling with different flags to enable/disable SSE, FMA, etc, but nothing has worked.

Processor: Xeon Skylake

Question What can I do to optimize this function?

Can you make the assumption that `i >= Li[j]` for all `j` in the inner loop? — chqrlie, Feb 14 '20 at 20:29
AVX512 includes scatter/gather instructions and conflict-detection instructions. You might do the following: gather-vectorize the loads, assuming all `Li[j]` are disjoint from `i`, check the assumption with conflict-detection instructions, check all `i`s are disjoint, compute, scatter-store the results. If any conflict is detected, fall back to the scalar implementation. — EOF, Feb 14 '20 at 20:39
@chqrlie Unfortunately not. But we have `i < Li[j] < n`. Updated the question to mention the lower-triangular nature of A. — user2476408, Feb 14 '20 at 21:01
How sparse is the matrix? It might be counterproductive to use the extra indirection. — chqrlie, Feb 14 '20 at 21:23
`vmovsd QWORD PTR [r9], xmm0` is an 8-byte store from a vector / FP register, using the pointer in integer register `r9`. If it's getting counts for cycles, it might actually be the FMA that produces `xmm0` being slow (because it has to wait for inputs). Perf counters in an OoO exec CPU have to pick some instruction to "blame", and often pick the instruction waiting for a result, not the one that's slow to produce it. There's nothing wrong with that asm, indirection just costs time. — Peter Cordes, Feb 15 '20 at 06:33
Think I understand now. Thanks to everyone for your answers! — user2476408, Feb 15 '20 at 15:57

score 3 · Accepted Answer · answered Mar 28 '20 at 10:29

This should depend quite a bit on the exact sparsity pattern of the matrix and the platform being used. I tested a few things with gcc 8.3.0 and compiler flags -O3 -march=native (which is -march=skylake on my CPU) on the lower triangle of this matrix of dimension 3006 with 19554 nonzero entries. Hopefully this is somewhat close to your setup, but in any case I hope these can give you an idea of where to start.

For timing I used google/benchmark with this source file. It defines benchBacksolveBaseline which benchmarks the implementation given in the question and benchBacksolveOptimized which benchmarks the proposed "optimized" implementations. There is also benchFillRhs which separately benchmarks the function that is used in both to generate some not completely trivial values for the right hand side. To get the time of the "pure" backsolves, the time that benchFillRhs takes should be subtracted.

1. Iterating strictly backwards

The outer loop in your implementation iterates through the columns backwards, while the inner loop iterates through the current column forwards. Seems like it would be more consistent to iterate through each column backwards as well:

for (int i=n-1; i>=0; --i) {
    for (int j=Lp[i+1]-1; j>=Lp[i]; --j) {
        x[i] -= Lx[j] * x[Li[j]];
    }
}

This barely changes the assembly (https://godbolt.org/z/CBZAT5), but the benchmark timings show a measureable improvement:

------------------------------------------------------------------
Benchmark                        Time             CPU   Iterations
------------------------------------------------------------------
benchFillRhs                  2737 ns         2734 ns      5120000
benchBacksolveBaseline       17412 ns        17421 ns       829630
benchBacksolveOptimized      16046 ns        16040 ns       853333

I assume this is caused by somehow more predictable cache access, but I did not look into it much further.

2. Less loads/stores in inner loop

As A is lower triangular, we have i < Li[j]. Therefore we know that x[Li[j]] will not change due to the changes to x[i] in the inner loop. We can put this knowledge into our implementation by using a temporary variable:

for (int i=n-1; i>=0; --i) {
    double xi_temp = x[i];
    for (int j=Lp[i+1]-1; j>=Lp[i]; --j) {
        xi_temp -= Lx[j] * x[Li[j]];
    }
    x[i] = xi_temp;
}

This makes gcc 8.3.0 move the store to memory from inside the inner loop to directly after its end (https://godbolt.org/z/vM4gPD). The benchmark for the test matrix on my system shows a small improvement:

------------------------------------------------------------------
Benchmark                        Time             CPU   Iterations
------------------------------------------------------------------
benchFillRhs                  2737 ns         2740 ns      5120000
benchBacksolveBaseline       17410 ns        17418 ns       814545
benchBacksolveOptimized      15155 ns        15147 ns       887129

3. Unroll the loop

While clang already starts unrolling the loop after the first suggested code change, gcc 8.3.0 still has not. So let's give that a try by additionally passing -funroll-loops.

------------------------------------------------------------------
Benchmark                        Time             CPU   Iterations
------------------------------------------------------------------
benchFillRhs                  2733 ns         2734 ns      5120000
benchBacksolveBaseline       15079 ns        15081 ns       953191
benchBacksolveOptimized      14392 ns        14385 ns       963441

Note that the baseline also improves, as the loop in that implementation is also unrolled. Our optimized version also benefits a bit from loop unrolling, but maybe not as much as we may have liked. Looking into the generated assembly (https://godbolt.org/z/_LJC5f), it seems like gcc might have gone a little far with 8 unrolls. For my setup, I can in fact do a little better with just one simple manual unroll. So drop the flag -funroll-loops again and implement the unrolling with something like this:

for (int i=n-1; i>=0; --i) {
    const int col_begin = Lp[i];
    const int col_end = Lp[i+1];
    const bool is_col_nnz_odd = (col_end - col_begin) & 1;
    double xi_temp = x[i];
    int j = col_end - 1;
    if (is_col_nnz_odd) {
        xi_temp -= Lx[j] * x[Li[j]];
        --j;
    }
    for (; j >= col_begin; j -= 2) {
        xi_temp -= Lx[j - 0] * x[Li[j - 0]] +
                   Lx[j - 1] * x[Li[j - 1]];
    }
    x[i] = xi_temp;
}

With that I measure:

------------------------------------------------------------------
Benchmark                        Time             CPU   Iterations
------------------------------------------------------------------
benchFillRhs                  2728 ns         2729 ns      5090909
benchBacksolveBaseline       17451 ns        17449 ns       822018
benchBacksolveOptimized      13440 ns        13443 ns      1018182

Other algorithms

All of these versions still use the same simple implementation of the backward solve on the sparse matrix structure. Inherently, operating on sparse matrix structures like these can have significant problems with memory traffic. At least for matrix factorizations, there are more sophisticated methods, that operate on dense submatrices that are assembled from the sparse structure. Examples are supernodal and multifrontal methods. I am a bit fuzzy on this, but I think that such methods will also apply this idea to layout and use dense matrix operations for lower triangular backwards solves (for example for Cholesky-type factorizations). So it might be worth to look into those kind of methods, if you are not forced to stick to the simple method that works on the sparse structure directly. See for example this survey by Davis.

Iterating backward: if this leads to a more sequential access pattern, hardware prefetch might become useful. Modern CPUs have pretty good memory *bandwidth* (especially for single-threaded code on a desktop/laptop), pretty terrible memory *latency*. So HW prefetch into L2 is huge, like 12 cycle latency vs. hundreds for DRAM. Most of Intel's HW prefetchers work forward or backward but at least one only works forward, so in general loop forwards over memory if either choice is equal otherwise. If not, backward is fine. — Peter Cordes, Mar 28 '20 at 16:02
Unrolling: the other difference between GCC and clang loop unrolling is that (with `-ffast-math`) clang will use multiple accumulators. GCC will unroll but not bother to create multiple dependency chains to hide ALU latency, defeating most of the purpose for reduction loops like `xi_temp -=`. Although if improvement 2. compiled the way we'd expect, taking store/reload store-forwarding latency off the critical path, but sped up by much less than a factor of 2, it seems FP latency isn't a big bottleneck (instead memory / cache misses), or that dep chains are short enough for OoO exec to hide. — Peter Cordes, Mar 28 '20 at 16:11

chqrlie · Answer 2 · 2020-02-14T21:03:57.967

1

You might shave a few cycles by using unsigned instead of int for the index types, which must be >= 0 anyway:

void backsolve(const unsigned * __restrict__ Lp,
               const unsigned * __restrict__ Li,
               const double * __restrict__ Lx,
               const unsigned n,
               double * __restrict__ x) {
    for (unsigned i = n; i-- > 0; ) {
        for (unsigned j = Lp[i]; j < Lp[i + 1]; ++j) {
            x[i] -= Lx[j] * x[Li[j]];
        }
    }
}

Compiling with Godbolt's compiler explorer shows slightly different code for the innerloop, potentially making better use of the CPU pipeline. I cannot test, but you could try.

Here is the generated code for the inner loop:

.L8:
        mov     rax, rcx
.L5:
        mov     ecx, DWORD PTR [r10+rax*4]
        vmovsd  xmm1, QWORD PTR [r11+rax*8]
        vfnmadd231sd    xmm0, xmm1, QWORD PTR [r8+rcx*8]
        lea     rcx, [rax+1]
        vmovsd  QWORD PTR [r9], xmm0
        cmp     rdi, rax
        jne     .L8

edited Feb 14 '20 at 21:03

answered Feb 14 '20 at 20:35

chqrlie

131,814
10
121
189

1

Could you explain why this would be faster? For me, gcc-9.2.1 produces assembly that is mostly effectively equivalent, except exchanging sign-extending loads with register-width loads. The only timing-impact I foresee is worse cache impact. – EOF Feb 14 '20 at 20:55
1

@EOF: I came to the same conclusion. Using `unsigned` instead of `size_t` still avoids the sign extension without a cache impact, and the code is slightly different, potentially allowing for better pipeline usage. – chqrlie Feb 14 '20 at 21:01
I've also tried `unsigned`, but I'm not seeing *anything* that looks like better pipelining with it. To me, it looks slightly *worse* than the `int` or `size_t` code. Anyway, it also seems `gcc` is trying to waste memory by using `incq %rax` with `gcc-9.2.1 -march=skylake-avx512` for the `int` and `unsigned` cases, where `incl %rax` would save a rex-byte. I'm kinda unimpressed with gcc today. – EOF Feb 14 '20 at 21:09
@EOF: I write *might* and *a few cycles*... this idea is definitely not a game changer. clang seems to try and unroll the inner loop, handling 2 entries at a time, but without any parallel operations. – chqrlie Feb 14 '20 at 21:13
The icc-19 assembly looks quite different. I wonder if there is something we can learn from there. But I don't know enough assembly to figure it out. – user2476408 Feb 14 '20 at 21:13
[clang](https://godbolt.org/z/Nrjm5f) also avoids even *trying* to save a byte on `incq/incl` by just using high registers that require a rex-prefix *anyway* (except in the `unsigned` case, where it decides to use `incq %rax` *and* `decq %rbx`. I'm starting to see why software seems to grow continually... – EOF Feb 14 '20 at 21:16
1

@user2476408: both icc-19 and clang-9.00 unroll the loop, handling 2 items per iteration. – chqrlie Feb 14 '20 at 21:16
1

@user2476408 the icc-assembly is still completely scalar. I'm not seeing anything interesting here. – EOF Feb 14 '20 at 21:20
@EOF: yes, they unroll the loop x2 but don't vectorize the operations. – chqrlie Feb 14 '20 at 21:21

Optimizing the backward solve for a sparse lower triangular linear system

2 Answers2

1. Iterating strictly backwards

2. Less loads/stores in inner loop

3. Unroll the loop

Other algorithms

Linked