Memory error when using += one to large matrix

Question

I'm attempting to write a collapsed Gibbs sampler in Python and am running into memory issues when creating initial values for one of my matrices. I am rather new to Python, so below is the outline of what I am doing with explanation. At 4 I receive my MemoryError

My goal is to:

Create an T,M matrix of zeros (plus an alpha value), where T is some small number such as 2:6 and M can be very large

import numpy as np
import pandas as pd
M = 500
N = 10000
T = 6
alpha = .3
NZM = np.zeros((T,M), dtype = np.float64) + alpha

Create an M,N matrix of numbers generated by a multinomial distribution from T topics which would look like the following.

Z = np.where(np.random.multinomial(1,[1./ntopics]*ntopics,size = M*N )==1)[1]
Z

array([[1, 3, 0, ..., 5, 3, 1],
       [3, 5, 0, ..., 5, 1, 2],
       [4, 5, 4, ..., 1, 3, 5],
       ..., 
       [1, 2, 1, ..., 0, 3, 4],
       [0, 5, 2, ..., 2, 5, 0],
       [2, 3, 2, ..., 4, 1, 5]])

Create an index out of these using .reshape(M*N)

Z_index = Z.reshape(M*N) 

array([1, 3, 0, ..., 4, 1, 5])

This step is where I receive my error. I Use Z_index to add one to each row of NZM that shows up as a value in Z. However, option 1 below is very slow while option 2 has a memory error.

# Option 1
for m in xrange(M):
    NZM[Z_index,m] += 1

# Option 2
NZM[Z_index,:] += 1  



---------------------------------------------------------------------------
MemoryError                               Traceback (most recent call last)
<ipython-input-88-087ab1ede05d> in <module>()
      2 # a memory error
      3 
----> 4 NZM[Z_index,:] += 1


MemoryError:

I want to add one to a row of this array each time it shows up in the Z_index. Is there a way to do this quickly and efficiently that I am unaware of? Thank you for taking the time to read this.

If you have a 64 bit OS, you can increase the virtual memory.Additionally, you can look at using HDFS with numpy, as it will operate on datasets which are larger than fit in memory by consecutive read writes to disk. — Chris, Apr 22 '16 at 06:29
@Chris Does Python not automatically use all memory available? I have 32GB's of memory (personal Christmas splurge) available, is there something specific I need to tell Python so it uses more than some pre-allocated amount? — Steve Bronder, Apr 22 '16 at 06:33
IIRC If you're on a 32 bit OS python can only use 4 GB of memory — Greg, Apr 22 '16 at 06:42
https://stackoverflow.com/questions/18282867/python-32-bit-memory-limits-on-64bit-windows — Greg, Apr 22 '16 at 06:43
@Greg I believe that is correct, however I am on a 64 bit OS and using ` python -c 'import struct; print struct.calcsize("P") * 8'` gives back that I am using a 64 bit python installation — Steve Bronder, Apr 22 '16 at 06:50
Also, you appear to have something strange with your Zindex. You pass an MxN (~5M) index to a TxM Matrix (~6 rows). . I dont know, but if you want to add 1 several times to these indicies (hence the duplication), I would do an aggregation so you do M[Zindex,:] += aggregated_counts. accessing a TxM matrix 5 million times is going to be very slow. — Chris, Apr 22 '16 at 06:51
@Chris I'm working under the assumption that passing an array as an index with [0,4,4,2,1,1] will 'sample' from the original matrix and add one to each of those rows. I guess I was trying to abuse in-place operations? I'm rather new to Python. Option 1 will loop through each column, taking `M*N` rows and adding one to those rows by column. Option 2 takes M*N rows and adds one to those rows. Are these not equivalent? It's the same index, and the same thing will happen at each column, but one operates one column at a time while the other does them all at the same time. — Steve Bronder, Apr 22 '16 at 07:04

score 1 · Accepted Answer · edited May 23 '17 at 12:07

My question is a duplicate of the question here, however it arise from an inquiry which I think is unique and will be found more easily by people searching for an error caused by large duplicate indices.

So a simple sanity check shows that this is not doing what I thought it was doing. I assumed that, given an index with multiples of the same row, += would add one more to those rows for each time that row was present in the index.

import numpy as np
import pandas as pd

NWZ = np.zeros((10,10), dtype=np.float64) + 1

index = np.repeat([0,3], [1, 3], axis=0)

index

array([0, 3, 3, 3])

NWZ[index,:] += 1

NWZ

array([[ 2.,  2.,  2.,  2.,  2.],
       [ 1.,  1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.,  1.],
       [ 2.,  2.,  2.,  2.,  2.],
       [ 1.,  1.,  1.,  1.,  1.]])

We can see this is not the case as giving += multiple instances of the same row will only lead to the original row having one added to it. Because += performs 'in place' operations I assumed that this operation would return

array([[ 2.,  2.,  2.,  2.,  2.],
       [ 1.,  1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.,  1.],
       [ 4.,  4.,  4.,  4.,  4.],
       [ 1.,  1.,  1.,  1.,  1.]])

However by using .__iadd__(1) explicitly we see that addition is not performed cumulatively as it iterates through the index.

NWZ[index,:].__iadd__(1)

array([[ 2.,  2.,  2.,  2.,  2.],
       [ 2.,  2.,  2.,  2.,  2.],
       [ 2.,  2.,  2.,  2.,  2.],
       [ 2.,  2.,  2.,  2.,  2.]])

You can go here for an intuitive explenation as to why this doesn't (and the user asserts shouldn't) happen.

An alternative solution to my problem is to first create a frequency table of the number of times row n shows up in my duplicate index. Then, since I'm only doing addition, add those frequencies to their corresponding rows.

from scipy.stats import itemfreq

index_counts = itemfreq(index)

N = len(index_counts[:,1])
NWZ[index_counts[:,0].astype(int),:] += index_counts[:,1].reshape(N,1)
NWZ

array([[ 2.,  2.,  2.,  2.,  2.],
       [ 1.,  1.,  1.,  1.,  1.],
       [ 1.,  1.,  1.,  1.,  1.],
       [ 4.,  4.,  4.,  4.,  4.],
       [ 1.,  1.,  1.,  1.,  1.]])

Memory error when using += one to large matrix

1 Answers1