import numpy as np
v = np.zeros((3,10000), dtype=np.float32)
mat = np.zeros((10000,10000000), dtype=np.int8)
w = np.matmul(v, mat)
yields
Traceback (most recent call last):
File "int_mul_test.py", line 6, in <module>
w = np.matmul(v, mat)
numpy.core._exceptions.MemoryError: Unable to allocate 373. GiB
for an array with shape (10000, 10000000) and data type float32
Apparently, numpy is trying to convert my 10k x 10m int8 matrix to dtype float32. Why does it need to do this? It seems extremely wasteful, and if matrix multiplication must work with float numbers in memory, it could convert say 1m columns at a time (which shouldn't sacrifice speed too much), instead of converting all 10m columns all at once.
My current solution is to use a loop to break the matrix into 10 pieces and reduce temporary memory allocation to 1/10 of the 373 GiB:
w = np.empty((v.shape[0],mat.shape[1]),dtype=np.float32)
start = 0
block = 1000000
for i in range(mat.shape[1]//block):
end = start + block
w[:,start:end] = np.matmul(v, mat[:,start:end])
start = end
w[:,start:] = np.matmul(v, mat[:,start:])
# runs in 396 seconds
Is there a numpy-idiomatic way to multiply "piece by piece" without manually coding a loop?