While not the fastest way of constructing a sparse matrix, this isn't horribly slow either, at least not the lil
assignment step:
In [204]: N=100
In [205]: M=sparse.lil_matrix((N,N))
In [206]: for i in range(N):
...: for j in range(N):
...: M[i,j]=(i==j)
In [207]: M
Out[207]:
<100x100 sparse matrix of type '<class 'numpy.float64'>'
with 100 stored elements in LInked List format>
It saved just the nonzero values to M
. I barely saw the delay during the loop.
So my guess is that most of the time is spent in the panadas
indexing expression:
np.sum(data[data['source_node']==i].destination_node.isin(data[data['source_node']==j].destination_node))
Converting data, often textual, into coocurance counts sparse matrices comes up often. They are used in learning code, pattern searches etc. scikit-learn
is often used. Also tensorflow
.
For N=1000
In [212]: %%timeit
...: M=sparse.lil_matrix((N,N))
...: for i in range(N):
...: for j in range(N):
...: M[i,j]=(i==j)
...:
1 loop, best of 3: 7.31 s per loop
Iteratively assigning these values to a dense array is faster, even if we include the conversion to sparse at the end.
In [213]: %%timeit
...: M=np.zeros((N,N))
...: for i in range(N):
...: for j in range(N):
...: M[i,j]=(i==j)
...:
1 loop, best of 3: 353 ms per loop
In [214]: %%timeit
...: M=np.zeros((N,N))
...: for i in range(N):
...: for j in range(N):
...: M[i,j]=(i==j)
...: M = sparse.lil_matrix(M)
...:
1 loop, best of 3: 353 ms per loop
But for the very large case, creating that intermediate dense array might hit memory problems.