0

I have several sparse vectors represented as lists of tuples eg.

[[(22357, 0.6265631775164965),
  (31265, 0.3900572375543419),
  (44744, 0.4075397480094991),
  (47751, 0.5377595092643747)],
 [(22354, 0.6265631775164965),
  (31261, 0.3900572375543419),
  (42344, 0.4075397480094991),
  (47751, 0.5377595092643747)],
...
]

And my goal is to compose scipy.sparse.csr_matrix from several millions of vectors like this.

I would like to ask if there exists some simple elegant solution for this kind of conversion without trying to stuck everything to memory.

EDIT: Just a clarification: My goal is to build the 2d matrix, where each of my sparse vectors represent one row in matrix.

hpaulj
  • 221,503
  • 14
  • 230
  • 353
ziky90
  • 2,627
  • 4
  • 33
  • 47
  • So is the goal to create one long vector or a 2d matrix? Are the vectors all of the same length coincidentally? –  Aug 03 '15 at 15:42
  • The goal is to create 2d matrix, where each of my sparse vectors will represent one line. – ziky90 Aug 03 '15 at 15:43

2 Answers2

3

Collecting indices,data into a structured array avoids the integer-double conversion issue. It is also a bit faster than the vstack approach (in limited testing) (With list data like this np.array is faster than np.vstack.)

indptr = np.cumsum([0]+[len(i) for i in vectors])
aa = np.array(vectors,dtype='i,f').flatten()
A = sparse.csr_matrix((aa['f1'], aa['f0'], indptr))

I substituted the list comprehension for map since I'm using Python3.

Indicies in the coo format (data, (i,j)) might be more intuitive

ii = [[i]*len(v) for i,v in enumerate(vectors)])
ii = np.array(ii).flatten()
aa = np.array(vectors,dtype='i,f').flatten()
A2 = sparse.coo_matrix((aa['f1'],(np.array(ii), aa['f0'])))
# A2.tocsr()

Here, ii from the 1st step is the row numbers for each sublist.

[[0, 0, 0, 0],
 [1, 1, 1, 1],
 [2, 2, 2, 2],
 [3, 3, 3, 3],
 ...]]

This construction method is slower than the csr direct indptr.


For a case where there are differing numbers of entries per row, this approach works (using intertools.chain to flatten lists):

A sample list (no empty rows for now):

In [779]: vectors=[[(1, .12),(3, .234),(6,1.23)],
                   [(2,.222)],
                   [(2,.23),(1,.34)]]

row indexes:

In [780]: ii=[[i]*len(v) for i,v in enumerate(vectors)]
In [781]: ii=list(chain(*ii))

column and data values pulled from tuples and flattened

In [782]: jj=[j for j,_ in chain(*vectors)]    
In [783]: data=[d for _,d in chain(*vectors)]

In [784]: ii
Out[784]: [0, 0, 0, 1, 2, 2]

In [785]: jj
Out[785]: [1, 3, 6, 2, 2, 1]

In [786]: data
Out[786]: [0.12, 0.234, 1.23, 0.222, 0.23, 0.34]

In [787]: A=sparse.csr_matrix((data,(ii,jj)))  # coo style input

In [788]: A.A
Out[788]: 
array([[ 0.   ,  0.12 ,  0.   ,  0.234,  0.   ,  0.   ,  1.23 ],
       [ 0.   ,  0.   ,  0.222,  0.   ,  0.   ,  0.   ,  0.   ],
       [ 0.   ,  0.34 ,  0.23 ,  0.   ,  0.   ,  0.   ,  0.   ]])
ziky90
  • 2,627
  • 4
  • 33
  • 47
hpaulj
  • 221,503
  • 14
  • 230
  • 353
  • Actually these two approaches fail when the vectors are not equal length. In that case there is always some sort of Python loop necessary and all bets are off as to what is the most efficient approach (I hide this fact in my answer by using `map`). –  Aug 04 '15 at 05:54
  • In that case, `itertools.chain` is a useful tool for flattening the list of lists. – hpaulj Aug 04 '15 at 06:26
  • Second solution with chain is exactly what I wanted. I have just fixed (replaced typo with `*ll` to `*vectors`) Thank you – ziky90 Aug 04 '15 at 09:01
1

Consider the following:

import numpy as np
from scipy.sparse import csr_matrix

vectors = [[(22357, 0.6265631775164965),
            (31265, 0.3900572375543419),
            (44744, 0.4075397480094991),
            (47751, 0.5377595092643747)],
           [(22354, 0.6265631775164965),
            (31261, 0.3900572375543419),
            (42344, 0.4075397480094991),
            (47751, 0.5377595092643747)]]

indptr = np.cumsum([0] + map(len, vectors))
indices, data = np.vstack(vectors).T
A = csr_matrix((data, indices.astype(int), indptr))

Unfortunately, this way the column indices are converted from integers to doubles and back. This works correctly for up to very large matrices, but is not ideal.

Community
  • 1
  • 1
  • Would a `coo` style input be more intuitive? – hpaulj Aug 03 '15 at 17:55
  • @hpaulj, normally I would say yes, but in this case the input is almost in csr format already and this was the requested format. –  Aug 04 '15 at 05:55