I want to multiply two huge matrices, size is more than 100,000 rows and columns. I run the task on a server that has several GPUs, let's say 8 RTX 3090 GPUs, their ram size is 24GB, apparently, the matrix cannot fit in it, so I cannot use cupy.array directly. Here is my idea:
- store two matrices in the main memory, using numpy.array
- cut them in blocks, maybe 4 blocks or 9 blocks
- send blocks to GPUs, compute it
- retrieve resulting blocks to main memory, reassemble them
Here are my questions:
- Is there any library in python that can implement my idea automatically?
- I want to use the GPUs in parallel, I think the bottleneck is the data transportation between main memory and GPU memory, which is numpy.array -> cupy.array. Can I move data in parallel using the multiprocessing library? How about the PCIe bus?
NOTE:
- assume the matrices are not sparse.
[[a1,b1], * [[a2,b2], = [[a1a2+b1c2, a1b2+b1d2],
[c1,d1]] [c2,d2]] [c1a2+d1c2, c1b2+d1d2]]
import cupy as cp
import numpy as np
N = 27000
P = 27000
# init two matrices
source1 = np.random.random((N * 2, P * 2))
source2 = np.random.random((N * 2, P * 2))
# cut them in blocks
a1 = source1[:N, :P]
b1 = source1[:N, P:]
c1 = source1[N:, :P]
d1 = source1[N:, P:]
a2 = source2[:N, :P]
b2 = source2[:N, P:]
c2 = source2[N:, :P]
d2 = source2[N:, P:]
# move a1 and a2 to one gpu
m1 = cp.array(a1)
m2 = cp.array(a2)
r1 = m1 * m2
# free memory so that m3 and m4 can fit in gpu's ram
del m1
del m2
# move b1 and c2 to one gpu
m3 = cp.array(b1)
m4 = cp.array(c2)
r2 = m3 * m4
del m3
del m4
r1 += r2