Dot product of huge arrays in numpy

Question

I have a huge array and I want to calculate dot product with a small array. But I am getting 'array is too big' Is there a work around?

import numpy as np

eMatrix = np.random.random_integers(low=0,high=100,size=(20000000,50))
pMatrix = np.random.random_integers(low=0,high=10,size=(50,50))

a = np.dot(eMatrix,pMatrix)

Error:
/Library/Python/2.7/site-packages/numpy/random/mtrand.so in mtrand.RandomState.random_integers (numpy/random/mtrand/mtrand.c:9385)()

/Library/Python/2.7/site-packages/numpy/random/mtrand.so in mtrand.RandomState.randint (numpy/random/mtrand/mtrand.c:7051)()

ValueError: array is too big.

This happens already at eMatrix =, no? You are asking for 10^9 integers - one GB times the number of bytes per integer. So at the very least you should put them into an array of dtype int8 rather than the default int64. — mdurant, Sep 05 '14 at 14:33
So 8GB for the first ePrime, at least the same again for a, and perhaps some unseen intermediate ones too. — mdurant, Sep 05 '14 at 14:53

score 3 · Answer 1 · edited May 23 '17 at 11:50

That error is raised when figuring the total size of the array, if it overflows the native int type, see here for the exact source code line.

For this to happen, regardless of your machine being 64 bits, you are almost certainly running 32 bit versions of Python (and NumPy). You can check if that is the case by doing:

>>> import sys
>>> sys.maxsize
2147483647 # <--- 2**31 - 1, on a 64 bit version you would get 2**63 - 1

Then again, you array is "only" 20000000 * 50 = 1000000000, which is just under 2**30. If I try to reproduce your results on a 32-bit numpy, I get a MemoryError:

>>> np.random.random_integers(low=0,high=100,size=(20000000,50))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "mtrand.pyx", line 1420, in mtrand.RandomState.random_integers (numpy\random\mtrand\mtrand.c:12943)
  File "mtrand.pyx", line 938, in mtrand.RandomState.randint (numpy\random\mtrand\mtrand.c:10338)
MemoryError

unless I increase the size beyond the magic 2**31 - 1 threshold

>>> np.random.random_integers(low=0,high=100,size=(2**30, 2))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "mtrand.pyx", line 1420, in mtrand.RandomState.random_integers (numpy\random\mtrand\mtrand.c:12943)
  File "mtrand.pyx", line 938, in mtrand.RandomState.randint (numpy\random\mtrand\mtrand.c:10338)
ValueError: array is too big.

Given the difference in the line numbers in your traceback and mine, I suspect you are using an older version. What does this output on your system:

>>> np.__version__
'1.10.0.dev-9c50f98'

Thanks for the insight! I am using numpy version 1.8.2 – Lanc Sep 07 '14 at 10:31 — Lanc, Sep 07 '14 at 10:31

score 0 · Answer 2 · answered Sep 05 '14 at 14:51

I think the only "simple" answer is get more RAM.

It took 15GB, but I was able to do this on my macbook.

In [1]: import numpy
In [2]: e = numpy.random.random_integers(low=0, high=100, size=(20000000, 50))
In [3]: p = numpy.random.random_integers(low=0, high=10, size=(50, 50))
In [4]: a = numpy.dot(e, p)
In [5]: a[0]
Out[5]:
array([14753, 12720, 15324, 13588, 16667, 16055, 14144, 15239, 15166,
       14293, 16786, 12358, 14880, 13846, 11950, 13836, 13393, 14679,
       15292, 15472, 15734, 12095, 14264, 12242, 12684, 11596, 15987,
       15275, 13572, 14534, 16472, 14818, 13374, 14115, 13171, 11927,
       14226, 13312, 16070, 13524, 16591, 16533, 15466, 15440, 15595,
       13164, 14278, 13692, 12415, 13314])

A possible solution might be using a sparse matrix and the sparse matrix dot operator.

For example, on my machine constructing just e as a dense matrix used 8GB of ram. Constructing a similar sparse matrix eprime:

In [1]: from scipy.sparse import rand
In [2]: eprime = rand(20000000, 50)

Has neglible cost in terms of memory.

I believe once you do a calculation like dot, you'll have a dense matrix again. — mdurant, Sep 05 '14 at 14:52
Hey @stderr as I mentioned above I too tried it on mac having 16GB ram but it is failing. — Lanc, Sep 05 '14 at 14:54
Also I don't want a sparse matrix, my matrix needs to be dense — Lanc, Sep 05 '14 at 14:55

score 0 · Answer 3 · answered Sep 05 '14 at 15:13

0

I believe the answer is you do not have enough RAM and also possibly you are running a 32 bit version of python. Maybe clarify what OS you are running. Many OSes will run both 32 and 64 bit programs.

answered Sep 05 '14 at 15:13

beiller

3,105
1
11
19

How do I check if I am running 32 bit version of python? – Lanc Sep 05 '14 at 15:34
As stated above see here on how to determine if you are running 64 bit or 32 bit python executable: http://stackoverflow.com/questions/1405913/how-do-i-determine-if-my-python-shell-is-executing-in-32bit-or-64bit-mode-on-os – beiller Sep 05 '14 at 17:44

Dot product of huge arrays in numpy

3 Answers3

Linked