-1

i need to calculate a 1 million*1 million computations to fill a sparse matrix.But when i use loops to fill the matrix line by line,i find it will take 6 minutes to do a just 100*100 computations.So the task won't be solved.Is there some ways to speed up the process?

import numpy as np
from scipy.sparse import lil_matrix
import pandas as pd  
tp = pd.read_csv('F:\\SogouDownload\\train.csv', iterator=True, chunksize=1000)  
data = pd.concat(tp, ignore_index=True) 
matrix=lil_matrix((1862220,1862220))
for i in range(1,1862220):
    for j in range(1,1862220):
        matrix[i-1,j-1]=np.sum(data[data['source_node']==i].destination_node.isin(data[data['source_node']==j].destination_node))
hpaulj
  • 221,503
  • 14
  • 230
  • 353
martin
  • 45
  • 5
  • 5
    showing us your code would be a good start. and 10^12 computations is inconditionnally too high. – Jean-François Fabre Feb 09 '17 at 15:07
  • And just how sparse are your matrices? Usually an approach using a dict in wich the keys are matrix-coordinates work. – jsbueno Feb 09 '17 at 15:08
  • Is this Python 2 or 3? If it's 2, you should be using `xrange` instead of `range`. – Fred Larson Feb 09 '17 at 15:10
  • Well, you could start by calculating the expression `data[data['source_node']==i` in the outer loop. – barak manos Feb 09 '17 at 15:11
  • the background problem is to study common friends of two people in a social network ,and there are 1862220 people in this network. – martin Feb 09 '17 at 15:18
  • Not sure if you plan linear algebra involving such matrices, but they often arise in mechanical engineering. There are special codes for processing them as lists (or other structures) of (i,j,value) tuples where the value for any (i,j) not present is zero, which also avoid running loops over all possible (i,j). You need to go in search of an appropriate library! – nigel222 Feb 09 '17 at 15:35
  • Ouch-iterative definition of every element of a very large sparse matrix. In the end how many values are nonzero? I would suggest buildin the dense array first, but that could hit memory errors. – hpaulj Feb 09 '17 at 15:40
  • Give us a toy example that doesn't require loading a `csv`. – hpaulj Feb 09 '17 at 15:42
  • 2
    This looks like a job for a sparse matrix multiplication, *not* for individually calculating 3.5 trillion mostly-zero matrix entries. – user2357112 Feb 09 '17 at 18:08

2 Answers2

0

While not the fastest way of constructing a sparse matrix, this isn't horribly slow either, at least not the lil assignment step:

In [204]: N=100
In [205]: M=sparse.lil_matrix((N,N))
In [206]: for i in range(N):
     ...:     for j in range(N):
     ...:         M[i,j]=(i==j)
In [207]: M
Out[207]: 
<100x100 sparse matrix of type '<class 'numpy.float64'>'
    with 100 stored elements in LInked List format>

It saved just the nonzero values to M. I barely saw the delay during the loop.

So my guess is that most of the time is spent in the panadas indexing expression:

np.sum(data[data['source_node']==i].destination_node.isin(data[data['source_node']==j].destination_node))

Converting data, often textual, into coocurance counts sparse matrices comes up often. They are used in learning code, pattern searches etc. scikit-learn is often used. Also tensorflow.


For N=1000

In [212]: %%timeit
     ...: M=sparse.lil_matrix((N,N))
     ...: for i in range(N):
     ...:       for j in range(N):
     ...:           M[i,j]=(i==j)
     ...: 
1 loop, best of 3: 7.31 s per loop

Iteratively assigning these values to a dense array is faster, even if we include the conversion to sparse at the end.

In [213]: %%timeit
     ...: M=np.zeros((N,N))
     ...: for i in range(N):
     ...:       for j in range(N):
     ...:           M[i,j]=(i==j)
     ...: 
1 loop, best of 3: 353 ms per loop

In [214]: %%timeit
     ...: M=np.zeros((N,N))
     ...: for i in range(N):
     ...:       for j in range(N):
     ...:           M[i,j]=(i==j)
     ...: M = sparse.lil_matrix(M)
     ...: 
1 loop, best of 3: 353 ms per loop

But for the very large case, creating that intermediate dense array might hit memory problems.

hpaulj
  • 221,503
  • 14
  • 230
  • 353
  • thanks for your help.In fact i can't create such large dense array.The practical way seem to reduce the dimensions.And i agree that it's the index expression slow the process. – martin Feb 10 '17 at 08:15
0

The technique to use here is sparse matrix multiplication. But for that technique you first need a binary matrix mapping source nodes to destination nodes (the node labels will be the indices of the nonzero entries).

from scipy.sparse import csr_matrix

I = data['source_node'] - 1
J = data['destination_node'] - 1
values = np.ones(len(data), int)
shape = (np.max(I) + 1, np.max(J) + 1)
mapping = csr_matrix((values, (I, J)), shape)

The technique itself is simply a matrix multiplication of this matrix with its transpose (see also this question).

cooccurrence = mapping.dot(mapping.T)

The only potential problem is that the resulting matrix may not be sparse and consumes all your RAM.

Community
  • 1
  • 1
user7138814
  • 1,991
  • 9
  • 11