Efficiently construct FEM/FVM matrix

Question

This is a typical use case for FEM/FVM equation systems, so is perhaps of broader interest. From a triangular mesh à la

I would like to create a scipy.sparse.csr_matrix. The matrix rows/columns represent values at the nodes of the mesh. The matrix has entries on the main diagonal and wherever two nodes are connected by an edge.

Here's an MWE that first builds a node->edge->cells relationship and then builds the matrix:

import numpy
import meshzoo
from scipy import sparse

nx = 1600
ny = 1000
verts, cells = meshzoo.rectangle(0.0, 1.61, 0.0, 1.0, nx, ny)

n = len(verts)

nds = cells.T
nodes_edge_cells = numpy.stack([nds[[1, 2]], nds[[2, 0]],nds[[0, 1]]], axis=1)

# assign values to each edge (per cell)
alpha = numpy.random.rand(3, len(cells))
vals = numpy.array([
    [alpha**2, -alpha],
    [-alpha, alpha**2],
    ])

# Build I, J, V entries for COO matrix
I = []
J = []
V = []
#
V.append(vals[0][0])
V.append(vals[0][1])
V.append(vals[1][0])
V.append(vals[1][1])
#
I.append(nodes_edge_cells[0])
I.append(nodes_edge_cells[0])
I.append(nodes_edge_cells[1])
I.append(nodes_edge_cells[1])
#
J.append(nodes_edge_cells[0])
J.append(nodes_edge_cells[1])
J.append(nodes_edge_cells[0])
J.append(nodes_edge_cells[1])
# Create suitable data for coo_matrix
I = numpy.concatenate(I).flat
J = numpy.concatenate(J).flat
V = numpy.concatenate(V).flat

matrix = sparse.coo_matrix((V, (I, J)), shape=(n, n))
matrix = matrix.tocsr()

With

python -m cProfile -o profile.prof main.py
snakeviz profile.prof

one can create and view a profile of the above:

The method tocsr() takes the lion share of the runtime here, but this is also true when building alpha is more complex. Consequently, I'm looking for ways to speed this up.

What I've already found:

Due to the structure of the data, the values on the diagonal of the matrix can be summed up in advance, i.e.,

V.append(vals[0, 0, 0] + vals[1, 1, 2])
I.append(nodes_edge_cells[0, 0])  # == nodes_edge_cells[1, 2]
J.append(nodes_edge_cells[0, 0])  # == nodes_edge_cells[1, 2]

This makes I, J, V shorter and thus speeds up tocsr.

Right now, edges are "per cell". I could identify equal edges with each other using numpy.unique, effectively saving about half of I, J, V. However, I found that this too takes some time. (Not surprising.)

One other thought that I had was that that I could replace the diagonal V, I, J by a simple numpy.add.at if there was a csr_matrix-like data structure where the main diagonal is kept separately. I know that this exists in some other software packages, but couldn't find it in scipy. Correct?

Perhaps there's a sensible way to construct CSR directly?

I'd like to see some timings for the different steps. The `tocsr` uses compiled code. I would think that any massaging of the `coo` inputs before hand would take just as long if not more. — hpaulj, Feb 20 '17 at 16:34
Are you sure `tocsr` is taking a long time? I've done something very similar for a 10k by 10k matrix where I, J, V had lengths well into the millions, and it didn't take all that long. Maybe 5-10 seconds. — BallpointBen, Feb 20 '17 at 17:23
Thanks for the comments. I've added a profile to the original post. — Nico Schlömer, Feb 20 '17 at 17:25
Interesting. I thought the `sum_duplicates` was part of the compiled `csr` construction, but your profile shows it's a Python method in `coo.py`. It uses `lexsort`, `np.nonzero` and `reduceat`. So doing your own `sum_duplicates` has promise. If your `coo` `has_canonical_format` the `tocsr` should be much faster. — hpaulj, Feb 20 '17 at 17:41
`csr` has its own `sum_duplicates` method. That uses a compiled `sparse._sparsetools.csr_sum_duplicates`. — hpaulj, Feb 20 '17 at 18:10

Paul Panzer · Answer 1 · 2017-02-20T22:13:32.693

3

I would try creating the csr structure directly, especially if you are resorting to np.unique since this gives you sorted keys, which is half the job done.

I'm assuming you are at the point where you have i, j sorted lexicographically and overlapping v summed using np.add.at on the optional inverse output of np.unique.

Then v and j are already in csr format. All that's left to do is creating the indptr which you simply get by np.searchsorted(i, np.arange(M+1)) where M is the column length. You can pass these directly to the sparse.csr_matrix constructor.

Ok, let code speak:

import numpy as np
from scipy import sparse
from timeit import timeit


def tocsr(I, J, E, N):
    n = len(I)
    K = np.empty((n,), dtype=np.int64)
    K.view(np.int32).reshape(n, 2).T[...] = J, I  
    S = np.argsort(K)
    KS = K[S]
    steps = np.flatnonzero(np.r_[1, np.diff(KS)])
    ED = np.add.reduceat(E[S], steps)
    JD, ID = KS[steps].view(np.int32).reshape(-1, 2).T
    ID = np.searchsorted(ID, np.arange(N+1))
    return sparse.csr_matrix((ED, np.array(JD, dtype=int), ID), (N, N))

def viacoo(I, J, E, N):
    return sparse.coo_matrix((E, (I, J)), (N, N)).tocsr()


#testing and timing

# correctness
N = 1000
A = np.random.random((N, N)) < 0.001
I, J = np.where(A)

E = np.random.random((2, len(I)))
D = np.zeros((2,) + A.shape)
D[:, I, J] = E
D2 = tocsr(np.r_[I, I], np.r_[J, J], E.ravel(), N).A

print('correct:', np.allclose(D.sum(axis=0), D2))

# speed
N = 100000
K = 10

I, J = np.random.randint(0, N, (2, K*N))
E = np.random.random((2 * len(I),))
I, J, E = np.r_[I, I, J, J], np.r_[J, J, I, I], np.r_[E, E]

print('N:', N, ' --  nnz (with duplicates):', len(E))
print('direct: ', timeit('f(a,b,c,d)', number=10, globals={'f': tocsr, 'a': I, 'b': J, 'c': E, 'd': N}), 'secs for 10 iterations')
print('via coo:', timeit('f(a,b,c,d)', number=10, globals={'f': viacoo, 'a': I, 'b': J, 'c': E, 'd': N}), 'secs for 10 iterations')

Prints:

correct: True
N: 100000  --  nnz (with duplicates): 4000000
direct:  7.702431229001377 secs for 10 iterations
via coo: 41.813509466010146 secs for 10 iterations

Speedup: 5x

edited Feb 20 '17 at 22:13

answered Feb 20 '17 at 16:39

Paul Panzer

51,835
3
54
99

I'd actually like to _avoid_ using `unique` since it's not a O(n) operation (or so I think). Of course, if `to_csr` does it anyways, I might as well do it myself. All good hints. – Nico Schlömer Feb 20 '17 at 17:30
1

I seem to rememeber `unique` uses `argsort`, so it's O(n log n). On the other hand CSR has sorted indices so unless your data are in any form presorted (are they?) or you have some graph-theoretical constraints (fixed or bounded number of edges per node for example) you can't avoid O(n log n). I don't suppose your data come in anyway grouped by node or something like that? Is the mesh as regular as it looks? Because that sort of structure could be exploited for a faster sort. – Paul Panzer Feb 20 '17 at 18:06
I need to be general on the mesh, so the particular structure is not something I can exploit. – Nico Schlömer Feb 20 '17 at 18:13
1

Looks like to key to speeding this up is generating the elements in a way that minimizes the `lexsort` time. If the duplicates are next to each other, `add.reduceat` can quickly sum them. – hpaulj Feb 20 '17 at 18:24
If you can sort elements by row (but not column) you might be able to construct the `csr` directly (with indptr etc), and use its own `sum_duplicates`. But that row sort might not be much faster than the lexsort on both rows and columns. – hpaulj Feb 20 '17 at 18:35
@hpaulj if `I` and `J` are `int32` `argsort` can be used as a surrogate for `lexsort`: `np.argsort(np.c_[I, J].ravel().view(np.int64))` Funnily enough that seems to be several-fold faster than `np.lexsort((J, I))` – Paul Panzer Feb 20 '17 at 19:11
1

@NicoSchlömer I've implemented a direct tocsr and it seems quite a bit faster than going through coo. Quite possibly much of the speedup is due to my avoiding `lexsort` in favour of `argsort`. Could you check it out on your actual data? I'd be interested in knowing how well it works out in the real world. – Paul Panzer Feb 20 '17 at 21:03
@PaulPanzer I'm getting `TypeError: Array must be contiguous. A non-contiguous array was given` from a solver further downstream. Seems like the output isn't quite identical. – Nico Schlömer Feb 20 '17 at 22:06
@NicoSchlömer Ok I changed that to force a copy. The only change is in the return statement. Could you try again? – Paul Panzer Feb 20 '17 at 22:14
I see a speedup from about 11s to 4s on a mesh with 1.2M nodes. – Great! – Nico Schlömer Feb 20 '17 at 22:19
1

@hpaulj seems lexsort uses mergesort while argsort uses quicksort per default. when I pass `kind=mergesort` then argsort produces the same output as lexsort. And runs more than twice slower. But still more than twice as fast as lexsort. These discrepancies are really confusing. – Paul Panzer Feb 20 '17 at 22:47
@PaulPanzer Some variant of your suggestion should probably go into scipy. Perhaps we can shift the discussion to [here](https://github.com/scipy/scipy/issues). – Nico Schlömer Feb 21 '17 at 11:36
@NicoSchlömer ok, why not. It would need some work, still, to check for endianness, int32 range etc., but it should be doable. Another possible approach would be to give lexsort the choice of different sort algo's since a stable sort is not required here (we are adding up duplicates anyway). Would you be willing to instruct me how to go about this? I'm not familiar with the procedures. – Paul Panzer Feb 21 '17 at 14:27
I'd say you open an issue [here](https://github.com/scipy/scipy/issues) with the suggestion for the a faster `tocsr` and hear what the gurus have to say. I can provide some performance figures. – Nico Schlömer Feb 21 '17 at 14:30
@NicoSchlömer it's [here](https://github.com/scipy/scipy/issues/7077) I promised them you'll post your timings there. – Paul Panzer Feb 21 '17 at 15:23

Nico Schlömer · Accepted Answer · 2017-02-22T07:56:20.960

So, in the end this turned out to be the difference between COO's and CSR's sum_duplicates (just like @hpaulj suspected). Thanks to the efforts of everyone involved here (particularly @paul-panzer), a PR is underway to give tocsr a tremendous speedup.

SciPy's tocsr does a lexsort on (I, J), so it helps organizing the indices in such a way that (I, J) will come out fairly sorted already.

For for nx=4, ny=2 in the above example, I and J are

[1 6 3 5 2 7 5 5 7 4 5 6 0 2 2 0 1 2 1 6 3 5 2 7 5 5 7 4 5 6 0 2 2 0 1 2 5 5 7 4 5 6 0 2 2 0 1 2 1 6 3 5 2 7 5 5 7 4 5 6 0 2 2 0 1 2 1 6 3 5 2 7]
[1 6 3 5 2 7 5 5 7 4 5 6 0 2 2 0 1 2 5 5 7 4 5 6 0 2 2 0 1 2 1 6 3 5 2 7 1 6 3 5 2 7 5 5 7 4 5 6 0 2 2 0 1 2 5 5 7 4 5 6 0 2 2 0 1 2 1 6 3 5 2 7]

First sorting each row of cells, then the rows by the first column like

cells = numpy.sort(cells, axis=1)
cells = cells[cells[:, 0].argsort()]

produces

[1 4 2 5 3 6 5 5 5 6 7 7 0 0 1 2 2 2 1 4 2 5 3 6 5 5 5 6 7 7 0 0 1 2 2 2 5 5 5 6 7 7 0 0 1 2 2 2 1 4 2 5 3 6 5 5 5 6 7 7 0 0 1 2 2 2 1 4 2 5 3 6]
[1 4 2 5 3 6 5 5 5 6 7 7 0 0 1 2 2 2 5 5 5 6 7 7 0 0 1 2 2 2 1 4 2 5 3 6 1 4 2 5 3 6 5 5 5 6 7 7 0 0 1 2 2 2 5 5 5 6 7 7 0 0 1 2 2 2 1 4 2 5 3 6]

For the number in the original post, sorting cuts down the runtime from about 40 seconds to 8 seconds.

Perhaps an even better ordering can be achieved if the nodes are numbered more appropriately in the first place. I'm thinking of Cuthill-McKee and friends.

Efficiently construct FEM/FVM matrix

2 Answers2

Linked