np.matmul with large integer matrix without conversion/copy

Question

import numpy as np

v = np.zeros((3,10000), dtype=np.float32)
mat = np.zeros((10000,10000000), dtype=np.int8)

w = np.matmul(v, mat)

yields

Traceback (most recent call last):
  File "int_mul_test.py", line 6, in <module>
    w = np.matmul(v, mat)
numpy.core._exceptions.MemoryError: Unable to allocate 373. GiB
for an array with shape (10000, 10000000) and data type float32

Apparently, numpy is trying to convert my 10k x 10m int8 matrix to dtype float32. Why does it need to do this? It seems extremely wasteful, and if matrix multiplication must work with float numbers in memory, it could convert say 1m columns at a time (which shouldn't sacrifice speed too much), instead of converting all 10m columns all at once.

My current solution is to use a loop to break the matrix into 10 pieces and reduce temporary memory allocation to 1/10 of the 373 GiB:

w = np.empty((v.shape[0],mat.shape[1]),dtype=np.float32)
start = 0
block = 1000000
for i in range(mat.shape[1]//block):
    end = start + block
    w[:,start:end] = np.matmul(v, mat[:,start:end])
    start = end
w[:,start:] = np.matmul(v, mat[:,start:])
# runs in 396 seconds

Is there a numpy-idiomatic way to multiply "piece by piece" without manually coding a loop?

For speed `matmul` use complied libraries like BLAS, which work with just a few dtypes (mainly float). Has a limited ability to work with other dtypes, such as `object`, but at a severe compromise in speed. What kind of results do you expect from such arrays (except for this all zeros example)? Even if stuck with `int8`, the result is likely to overflow. — hpaulj, Feb 15 '22 at 20:40
The result is only 3 x 10m so it can be float32 no problem. Numpy creating an intermediate, temporary 10k x 10m float32 matrix which occupies 4x space as the original is the problem. — Junyan Xu, Feb 15 '22 at 21:46

score 1 · Accepted Answer · answered Feb 15 '22 at 22:04

The semantic of Numpy operations force the inputs of a binary operation to be casted when the left/right types are different. In fact, this is the case in almost all statically typed language including C, C++, Java, Rust, but also many dynamically-typed languages (the semantic rules are applied at runtime in this case). Python also (partially) applies such a well defined semantic rule. For example, when you evaluate the expression True * 1.7, the interpreter evaluates the type of both operands (bool and float here) and then applies multiple semantic rules until the type of both operand are the same before performing the actual multiplication. In this case, True of type bool is casted to 1 of type int which is then casted to 1.0 of type float. Such semantic rules are generally defined in a way that is both relatively unambiguous and safe. For example, you do not expect 2 * 1.7 to be equal to 3. Numpy use semantic rules similar to the ones of the C language because it is written in C and provide native types. The semantic rules should be defined independently of a given implementation. That being said performance and ease-of-use matters a lot when designing it. Unfortunately, in your case, this means a huge array has to be allocated.

Note that Numpy could theoretically bypass casting and implement the N * N possible versions for the N different types for each binary operations in order to make them faster (like the "as-if" semantic rule of the C language). However, this would be insane to implement for developers and it would result in a more bug-prone code (ie. less stable and slower development) and a huge code bloat (bigger binaries). This is especially true since other parameters should be taken into account like the shape of the array and the memory layout (eg. alignment) or event the target architecture. The current main casting generative function of Numpy is already quite complex and already results in 4 x 18 x 18 = 1296 different C functions to be compiled and stored in Numpy binaries!

In your case, you can use Cython or Numba to generate a memory-efficient (and possibly faster) implementation dedicated to your specific needs. Be careful about possible overflows though.

np.matmul with large integer matrix without conversion/copy

1 Answers1

Linked