Why does X.dot(X.T) require so much memory in numpy?

Question

X is a n x p matrix where p is much larger than n. Let's say n = 1000 and p = 500000. When I run:

X = np.random.randn(1000,500000)
S = X.dot(X.T)

Performing this operation ends up taking a great deal of memory despite the result being of size 1000 x 1000. The memory use goes back down once the operation is finished. Is there any way around this?

Interesting. Aside from the space for `X` and `S` this should be doable in very little, constant space: Take a transposed view of `X`, calculate and allocate the space for `S`, and calculate each element of `S` directly. And from the little I know about NumPy, something similar should fall out automatically from this composition of operations... — , Jan 18 '14 at 18:56
I am not sure of the details, but whenever possible numpy tries to use BLAS routines, which I think would be [`sgemm`](http://www.math.utah.edu/software/lapack/lapack-blas/sgemm.html) for matrix multiplication. My bet is that these highly optimized routines require contiguous data (probably in Fortran order), and thus a copy has to be made in your case. — Jaime, Jan 18 '14 at 23:16

score 7 · Accepted Answer · edited May 23 '17 at 12:30

The issue is not that X and X.T are views of the same memory space per se, but rather that X.T is F-contiguous rather than C-contiguous. Of course, this must necessarily be true for at least one of the input arrays in the case where you're multiplying an array with a view of its transpose.

In numpy < 1.8, np.dot will create a C-ordered copy of any F-ordered input arrays, not just ones that happen to be views onto the same block of memory.

For example:

X = np.random.randn(1000,50000)
Y = np.random.randn(50000, 100)

# X and Y are both C-order, no copy
%memit np.dot(X, Y)
# maximum of 1: 485.554688 MB per loop

# make X Fortran order and Y C-order, now the larger array (X) gets
# copied
X = np.asfortranarray(X)
%memit np.dot(X, Y)
# maximum of 1: 867.070312 MB per loop

# make X C-order and  Y Fortran order, now the smaller array (Y) gets
# copied
X = np.ascontiguousarray(X)
Y = np.asfortranarray(Y)
%memit np.dot(X, Y)
# maximum of 1: 523.792969 MB per loop

# make both of them F-ordered, both get copied!
X = np.asfortranarray(X)
%memit np.dot(X, Y)
# maximum of 1: 905.093750 MB per loop

If copying is a problem (e.g. when X is very large), what can you do about it?

The best option would probably be to upgrade to a newer version of numpy - as @perimosocordiae points out, this performance issue was addressed in this pull request.

If for whatever reason you can't upgrade numpy, there is also a trick that allows you to perform fast, BLAS-based dot products without forcing a copy by calling the relevant BLAS function directly through scipy.linalg.blas (shamelessly stolen from this answer):

from scipy.linalg import blas
X = np.random.randn(1000,50000)

%memit res1 = np.dot(X, X.T)
# maximum of 1: 845.367188 MB per loop

%memit res2 = blas.dgemm(alpha=1., a=X.T, b=X.T, trans_a=True)
# maximum of 1: 471.656250 MB per loop

print np.all(res1 == res2)
# True

It's mentioned above, but numpy >=1.8 solves this problem for you: https://github.com/numpy/numpy/pull/2730 — perimosocordiae, Jan 19 '14 at 09:21
@perimosocordiae Yep, but some people might find it awkward to upgrade for whatever reason (broken dependencies etc.), so I thought I'd give a solution using 1.7.1. I'll edit my answer to make that clearer. — ali_m, Jan 19 '14 at 10:17

Why does X.dot(X.T) require so much memory in numpy?

1 Answers1