1

I am working with binary (only 0's and 1's) matrices of rows and columns in the order of a few thousands. For example, the number of rows are between 2000 - 7000 and number of columns are between 4000 - 15000. My computer has more then 100g RAM.

I'm surprised that even with these sizes, I am getting MemoryError with the following code. For reproducibility, I'm including an example with a smaller matrix (10*20) Note than both of the following raise this error:

   import numpy as np

   my_matrix = np.random.randint(2,size=(10,20))
   tr, tc = np.triu_indices(my_matrix.shape[0],1)
   ut_sums = np.sum(my_matrix[tr] * my_matrix[tc], 1)

   denominator = 100
   value = 1 - ut_sums.astype(float)/denominator
   np.einsum('i->', value)

I tried to replace the elementwise multiplication in the above code to einsum as below, but it also generates the same MemoryError:

   import numpy as np

   my_matrix = np.random.randint(2,size=(10,20))
   tr, tc = np.triu_indices(my_matrix.shape[0],1)
   ut_sums = np.einsum('ij,ij->i', my_matrix[tr], my_matrix[tc])

   denominator = 100
   value = 1 - ut_sums.astype(float)/denominator
   np.einsum('i->', value)

In both cases, the printed Traceback points to the line where ut_sums is being calculated.

Please note that my code has other operations too, and there are other statistics calculated on matrices of similar sizes, but with more than 100 g, I thought it should not be a problem.

vpk
  • 1,240
  • 2
  • 18
  • 32

1 Answers1

1

Just because your computer has 100 GB of physical memory does not mean that your operating system is willing or able to allocate such large amounts of contiguous memory. And it does have to be contiguous, because that's how NumPy arrays usually are.

You should figure out how large your output matrix is meant to be, and then try creating a similar one by itself:

arr = np.zeros((10000, 10000))

See if you're able to allocate a single array as large as you want.

John Zwinck
  • 239,568
  • 38
  • 324
  • 436
  • I was able to create matrix with np.zeros((100000, 100000)) too, but a matrix of size 1000000 * 1000000 cannot be created because of MemoryError. However, my matrix is much smaller in size. Is there something I can do to solve it? The matrix is binary (ones and zeros only) and sparse. Could I use sparse matrix from scipy? Would the elemet-wise multiplication or einsum operations work well with sparse matrices? – vpk May 11 '16 at 11:11
  • Yes, if your data are sparse you should take advantage of the sparse matrix features in NumPy and friends. And if your matrix is only zeros and ones: are you using `dtype=bool`? – John Zwinck May 11 '16 at 11:41
  • No I'm not using dtype=bool. Now that I think of it I could use it during creation of matrix, but I haven't thought how to calculate ut_sums with bool matrix. Also, just now I tried scipy.sparse's lil_matrix, doing elementwise multiplication as in the first code block above. But on a relatively powerful machine it ran for about 15 minutes before failing with MemoryError. The stacktrack points to the line where I create a sparse matrix from the original as: my_matrix = scipy.sparse.lil_matrix(np.random.randint(2, size=(100000, 100000))) This seems a bit unreasonable to me. – vpk May 11 '16 at 11:52
  • Well if your values are 0 and 1, you really need to figure out how to switch to dtype=bool. It will save a lot of memory. Or at least switch to np.uint8. – John Zwinck May 11 '16 at 12:10
  • This is not exactly a duplicate- as this problem cannot be solved as the other thread mentions. After further inspection, I found out that the size of arrays returned by np.triu_indices was very large, which could have caused the error. I used the above method to avoid looping through rows of the original matrix to do some pairwise calculation. Now I use loops again (although in a different way- it avoids triu_indices but achieves the same goal) and now it's much faster and throws no error. Is it duplicate, or should I post this solution by editing my own question? – vpk May 12 '16 at 08:57
  • @vpk: I've reopened the question so that you can post an answer to your own question. Don't edit the answer into the question--just post an answer below (which you can later accept). – John Zwinck May 12 '16 at 12:21
  • 1
    M2C here, Use `ulimit` to increase the amt of memory a process can request from the kernel cf. https://stackoverflow.com/questions/12718148/how-to-allocate-more-memory-to-a-process-in-linux Second, @JohnZwinck `np.zero` may not be a good test to check the max possible mem you can get from the kernel. See the answer from https://stackoverflow.com/questions/27574881/why-does-numpy-zeros-takes-up-little-space – gokul_uf Apr 14 '18 at 23:24