Is there any way to boost matrix multiplication using multiple GPUs?

Question

I want to multiply two huge matrices, size is more than 100,000 rows and columns. I run the task on a server that has several GPUs, let's say 8 RTX 3090 GPUs, their ram size is 24GB, apparently, the matrix cannot fit in it, so I cannot use cupy.array directly. Here is my idea:

store two matrices in the main memory, using numpy.array
cut them in blocks, maybe 4 blocks or 9 blocks
send blocks to GPUs, compute it
retrieve resulting blocks to main memory, reassemble them

Here are my questions:

Is there any library in python that can implement my idea automatically?
I want to use the GPUs in parallel, I think the bottleneck is the data transportation between main memory and GPU memory, which is numpy.array -> cupy.array. Can I move data in parallel using the multiprocessing library? How about the PCIe bus?

NOTE:

assume the matrices are not sparse.

[[a1,b1],   *   [[a2,b2],   =   [[a1a2+b1c2, a1b2+b1d2],
 [c1,d1]]        [c2,d2]]        [c1a2+d1c2, c1b2+d1d2]]

import cupy as cp
import numpy as np

N = 27000
P = 27000

# init two matrices
source1 = np.random.random((N * 2, P * 2))
source2 = np.random.random((N * 2, P * 2))

# cut them in blocks
a1 = source1[:N, :P]
b1 = source1[:N, P:]
c1 = source1[N:, :P]
d1 = source1[N:, P:]

a2 = source2[:N, :P]
b2 = source2[:N, P:]
c2 = source2[N:, :P]
d2 = source2[N:, P:]

# move a1 and a2 to one gpu
m1 = cp.array(a1)
m2 = cp.array(a2)
r1 = m1 * m2
# free memory so that m3 and m4 can fit in gpu's ram
del m1
del m2

# move b1 and c2 to one gpu
m3 = cp.array(b1)
m4 = cp.array(c2)
r2 = m3 * m4
del m3
del m4
r1 += r2

Yes, since the matrices are really large, multiply them on CPUs may take hours. Based on my experiments, it only takes minutes using one GPU. — 吴慈霆, Dec 29 '21 at 08:11
Consider pytorch (or maybe tensorflow). It is well supported and integrates closely with numpy. I've had mixed results with pyopencl and numba. — anon01, Dec 29 '21 at 10:13

score 2 · Answer 1 · answered Jan 26 '22 at 04:15

2

Dask supports array operations (including matrix multiplication) on GPUs via CuPy backed arrays. You can use a multi-node, multi-GPU cluster with Dask-CUDA.

answered Jan 26 '22 at 04:15

Nick Becker

4,059
13
19

score 0 · Answer 2 · answered Dec 29 '21 at 08:32

Look into the "cuBLAS Multi-GPU Extension": https://developer.nvidia.com/cublas

You'll have to apply for the early access program. Existing python libraries probably won't take advantage of this extension but you may be able to just enable it once you update your CUDA libraries. You'd have to read the documentation once you have access.

score -1 · Answer 3 · answered Dec 29 '21 at 08:28

Python has a special library：https://documen.tician.de/pycuda/

Simple example：

import pycuda.autoinit
import pycuda.driver as drv
import numpy

from pycuda.compiler import SourceModule
mod = SourceModule("""
__global__ void multiply_them(float *dest, float *a, float *b)
{
  const int i = threadIdx.x;
  dest[i] = a[i] * b[i];
}
""")

multiply_them = mod.get_function("multiply_them")

a = numpy.random.randn(400).astype(numpy.float32)
b = numpy.random.randn(400).astype(numpy.float32)

dest = numpy.zeros_like(a)
multiply_them(
        drv.Out(dest), drv.In(a), drv.In(b),
        block=(400,1,1), grid=(1,1))

print dest-a*b

Is there any way to boost matrix multiplication using multiple GPUs?

3 Answers3