0

Currently I'm implementing this paper for my undergraduate theses with python, but I only use the mahalanobis metric learning (in case you're curious).

In a shortcut, I face a problem when I need to learn a matrix with the size of 67K*67K consisting of integer, by simply numpy.dot(A.T,A) where A is a random vector sized (1,67K). When I do that it's simply throw MemoryError since my PC only have 8gb ram, and the raw calculation of the memory needed is 16gb to init. Than I search for alternative and found dask.

so i moved on to dask with this dask.array.dot(A.T,A) and it's done. But than I need to do whitening transformation to that matrix, and in dask I can achieve it by get the SVD. But everytime I do that SVD, the ipython kernel dies (I assume it due to lack of memory).

this is what I do so far from init, until the kernel dies:

fv_length=512*2*66
W = da.random.randint(10,20,(fv_length),(1000,1000))  
W = da.reshape(W,(1,fv_length))
W_T = W.T
Wt = da.dot(W_T,W); del W,W_T
Wt = da.reshape(Wt,(fv_length*fv_length/2,2))
U,S,Vt = da.linalg.svd(Wt); del Wt

I didn't get the U,S,and Vt yet.

Is my memory simply not enough to do these sort of things, even when I'm using dask? or actually this is not a spec problem, but my bad memory management? or something else?

At this point I'm desperately trying in other bigger spec PC, so I am planning to rent a bare metal server with a 32gb ram. Even if I do so, is it enough?

Community
  • 1
  • 1
  • Do you need the full SVD, or are you only interested in the *N* largest singular values/vectors? – ali_m Jun 19 '16 at 20:46
  • I need the SVD, because furthermore I want to do whitening transformation, and PCA with that result. Btw @mrocklin has convinced me that doing things on a bigger spec much worth. Thanks anyway – yusufazishty Jun 19 '16 at 21:21
  • You can generate a rank *N* whitened matrix from the *N*-largest singular values and vectors. Depending on the size of *N*, this can be many orders of magnitude more efficient than computing the full SVD. – ali_m Jun 19 '16 at 21:25
  • any reference or tutorial how to get that? – yusufazishty Jun 19 '16 at 21:30
  • If `U, s, Vt = svd(X)` then the columns of `U[:, :n]` and the rows of `Vt[:n, :]` will contain orthogonal vectors. Assuming that you subtracted the mean before computing the SVD, then `U[:, :n].dot(Vt[:n])` will be a whitened version of `X`. At that point you've essentially already done PCA (see my previous answer [here](http://stackoverflow.com/a/12273032/1461210)). [`da.linalg.svd_compressed`](http://dask.pydata.org/en/latest/array-api.html?highlight=linalg#dask.array.linalg.svd_compressed) uses Halko et al's clever randomized algorithm to efficiently compute the partial SVD. – ali_m Jun 19 '16 at 21:49
  • thanks alot @ali_m it help me so much – yusufazishty Jun 20 '16 at 06:21

1 Answers1

0

Generally speaking dask.array does not guarantee out-of-core operation on all computations. A square matrix-matrix multiply (or any L3 BLAS operation) is more-or-less impossible to do efficiently in small memory.

You can ask Dask to use an on-disk cache for intermediate values. See the FAQ under the question My computation fills memory, how do I spill to disk?. However this will be limited by disk-writing speeds, which are generally fairly slow.

A large memory machine and NumPy is probably the simplest way to resolve this problem. Alternatively you could try to find a different formulation of your problem.

MRocklin
  • 55,641
  • 23
  • 163
  • 235