0

I'm running into a memory error issue with numpy. The following line of code seems to be the issue:

self.D_r = numpy.diag(1/numpy.sqrt(self.r))

Where self.r is a relatively small numpy array.

The interesting thing is I monitored the memory usage and the process took up at most 3% of the RAM on the machine. So I'm thinking there's something that is killing the script before all the RAM is taken up because there is an expectation that the process will do so. If anybody has any ideas I would be very grateful.

Edit 1:

Here's the traceback:

Traceback (most recent call last):
File "/path_to_file/my_script.py", line 82, in <module>
mca_X = mca.mca(X)
File "/path_to_file/mca.py", line 54, in __init__
self.D_r = numpy.diag(1/numpy.sqrt(self.r.values))
File "/path_to_file/numpy/lib/twodim_base.py", line 302, in diag
res = zeros((n, n), v.dtype)
MemoryError

Running the script on KDD Cup 99 data (with one-hot-encoded nominal variables).

  • Could you add the exact error message? – egpbos Mar 22 '16 at 18:47
  • 2
    I think Python won't occupy the memory if it sees that you won't have enough so the memory usage is probably not relevant. Could you give a minimal working example with data that can be tested upon? – armatita Mar 22 '16 at 18:48
  • How small is relatively small? 100 entries? 1 million entries? A trillion? I find numpy doesn't like it if you pass a 1D array into diag which is more than 100,000 elements, as it is too big. – zephyr Mar 22 '16 at 19:10
  • break up the offending line into its individual computations. i.e., take the square root on one line, then do the inversion on a second line, etc. see which of those lines raises the error. – abcd Mar 22 '16 at 19:12
  • Is `self.r` a 1D or 2D array? – Mike Müller Mar 22 '16 at 19:13
  • Great, question @MikeMüller. That's the answer! It's a 1-d array, which means diag is making it into a 2-d array. The 1-d array was not large but the 2-d array is massive (~2TB). – Jan-Samuel Wagner Mar 22 '16 at 19:26
  • Is there a way to do this as a sparse matrix operation? – Jan-Samuel Wagner Mar 22 '16 at 19:43
  • Bear in mind that the kernel will [allocate memory lazily when you call `np.zeros`](http://stackoverflow.com/a/27582592/1461210). You probably won't see your memory usage increase until you actually start writing to elements in your empty array. – ali_m Mar 22 '16 at 21:36
  • *"Is there a way to do this as a sparse matrix operation?"* That depends on what operations you actually want to perform using `self.D_r`. You could use [`scipy.sparse.diags`](http://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.diags.html) to create a sparse diagonal matrix ([`dia_matrix`](http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.sparse.dia_matrix.html)), which supports basic arithmetic and linear algebra ops, but not direct indexing. You can read about other sparse matrix classes [here](http://docs.scipy.org/doc/scipy/reference/sparse.html). – ali_m Mar 22 '16 at 21:45

2 Answers2

0

If the argument to np.diag() is a 1D, it creates a 2D array, using the 1D array as diagonal:

Signature: np.diag(v, k=0)

Parameters

 v : array_like
   If `v` is a 2-D array, return a copy of its `k`-th diagonal.
   If `v` is a 1-D array, return a 2-D array with `v` on the `k`-th
   diagonal.

This squares the memory size of the array.

Mike Müller
  • 82,630
  • 20
  • 166
  • 161
0

if self.r is a 1D little array of more than 51000 éléments it can create a memory error :

In [85]: a=np.diag(arange(5e4))

In [86]: a.shape
Out[86]: (50000, 50000)

In [88]: a.size*a.itemsize
Out[88]: 20 000 000 000 # 20 Go

In [87]: a=np.diag(arange(5.1e4))
---------------------------------------------------------------------------
MemoryError                         
B. M.
  • 18,243
  • 2
  • 35
  • 54