1

Is there a way I can allocate memory for scipy sparse matrix functions to process large datasets?

Specifically, I'm attempting to use Asymmetric Least Squares Smoothing (translated into python here and the original here) to perform a baseline correction on a large mass spec dataset (length of ~60,000).

The function (see below) uses the scipy.sparse matrix operations.

def baseline_als(y, lam, p, niter):
  L = len(y)
  D = sparse.csc_matrix(np.diff(np.eye(L), 2))
  w = np.ones(L)
  for i in xrange(niter):
    W = sparse.spdiags(w, 0, L, L)
    Z = W + lam * D.dot(D.transpose())
    z = spsolve(Z, w*y)
    w = p * (y > z) + (1-p) * (y < z)
  return z

I have no problem when I pass data sets that are 10,000 or less in length:

baseline_als(np.ones(10000),100,0.1,10)

But when passing larger data sets, e.g.

baseline_als(np.ones(50000), 100, 0.1, 10)

I get a MemoryError, for the line

  D = sparse.csc_matrix(np.diff(np.eye(L), 2))
Community
  • 1
  • 1
  • No you can't allocate memory for numpy arrays. I suspect you are getting the memory error in the `np.eye` or `np.diff` functions. Both will produced `(L,L)` shape arrays - very big ones. Try those without the `sparse` call. – hpaulj Sep 25 '15 at 02:15
  • @hpaulj You're right. The error arrises when the np.diff function is passed the large np.eye generated array: np.diff(np.eye(len(np.ones(50000)))) I'm new to python. Anyone know a work around? – Todd Duncombe Sep 25 '15 at 02:24
  • Usually when people make large sparse matrices, they try to construct them without first making the equivalent dense array. Making the dense one is convenient in small cases, but defeats many of the advantages of using sparse ones. – hpaulj Sep 25 '15 at 02:37
  • I've implemented ALS in python as well, and I found that using solve_banded sped things up significantly. I'd also be willing to share my implementation, if you want it. – perimosocordiae Sep 25 '15 at 16:51
  • @perimosocordiae I'd appreciate it if you could share the implementation. I plan on analyzing a reasonably large set data, any improvement in speed would be beneficial. – Todd Duncombe Sep 25 '15 at 17:14
  • Here's a gist with my implementation: https://gist.github.com/perimosocordiae/efabc30c4b2c9afd8a83 – perimosocordiae Sep 25 '15 at 18:04
  • @permosocordiae wow. That speeds it up exponentially. Thanks for sharing. – Todd Duncombe Sep 25 '15 at 18:29

1 Answers1

2

Try changing

D = sparse.csc_matrix(np.diff(np.eye(L), 2))

to

diag = np.ones(L - 2)
D = sparse.spdiags([diag, -2*diag, diag], [0, -1, -2], L, L-2)

D will be a sparse matrix in DIAgonal format. If it turns out that being in CSC format is important, convert it using the tocsc() method:

D = sparse.spdiags([diag, -2*diag, diag], [0, -1, -2], L, L-2).tocsc()

The following example shows that the old and new versions generate the same matrix:

In [67]: from scipy import sparse

In [68]: L = 8

Original:

In [69]: D = sparse.csc_matrix(np.diff(np.eye(L), 2))

In [70]: D.A
Out[70]: 
array([[ 1.,  0.,  0.,  0.,  0.,  0.],
       [-2.,  1.,  0.,  0.,  0.,  0.],
       [ 1., -2.,  1.,  0.,  0.,  0.],
       [ 0.,  1., -2.,  1.,  0.,  0.],
       [ 0.,  0.,  1., -2.,  1.,  0.],
       [ 0.,  0.,  0.,  1., -2.,  1.],
       [ 0.,  0.,  0.,  0.,  1., -2.],
       [ 0.,  0.,  0.,  0.,  0.,  1.]])

New version:

In [71]: diag = np.ones(L - 2)

In [72]: D = sparse.spdiags([diag, -2*diag, diag], [0, -1, -2], L, L-2)

In [73]: D.A
Out[73]: 
array([[ 1.,  0.,  0.,  0.,  0.,  0.],
       [-2.,  1.,  0.,  0.,  0.,  0.],
       [ 1., -2.,  1.,  0.,  0.,  0.],
       [ 0.,  1., -2.,  1.,  0.,  0.],
       [ 0.,  0.,  1., -2.,  1.,  0.],
       [ 0.,  0.,  0.,  1., -2.,  1.],
       [ 0.,  0.,  0.,  0.,  1., -2.],
       [ 0.,  0.,  0.,  0.,  0.,  1.]])
Warren Weckesser
  • 110,654
  • 19
  • 194
  • 214