4

I want to multiply two huge matrices, size is more than 100,000 rows and columns. I run the task on a server that has several GPUs, let's say 8 RTX 3090 GPUs, their ram size is 24GB, apparently, the matrix cannot fit in it, so I cannot use cupy.array directly. Here is my idea:

  1. store two matrices in the main memory, using numpy.array
  2. cut them in blocks, maybe 4 blocks or 9 blocks
  3. send blocks to GPUs, compute it
  4. retrieve resulting blocks to main memory, reassemble them

Here are my questions:

  1. Is there any library in python that can implement my idea automatically?
  2. I want to use the GPUs in parallel, I think the bottleneck is the data transportation between main memory and GPU memory, which is numpy.array -> cupy.array. Can I move data in parallel using the multiprocessing library? How about the PCIe bus?

NOTE:

  1. assume the matrices are not sparse.
[[a1,b1],   *   [[a2,b2],   =   [[a1a2+b1c2, a1b2+b1d2],
 [c1,d1]]        [c2,d2]]        [c1a2+d1c2, c1b2+d1d2]]
import cupy as cp
import numpy as np

N = 27000
P = 27000

# init two matrices
source1 = np.random.random((N * 2, P * 2))
source2 = np.random.random((N * 2, P * 2))

# cut them in blocks
a1 = source1[:N, :P]
b1 = source1[:N, P:]
c1 = source1[N:, :P]
d1 = source1[N:, P:]

a2 = source2[:N, :P]
b2 = source2[:N, P:]
c2 = source2[N:, :P]
d2 = source2[N:, P:]

# move a1 and a2 to one gpu
m1 = cp.array(a1)
m2 = cp.array(a2)
r1 = m1 * m2
# free memory so that m3 and m4 can fit in gpu's ram
del m1
del m2

# move b1 and c2 to one gpu
m3 = cp.array(b1)
m4 = cp.array(c2)
r2 = m3 * m4
del m3
del m4
r1 += r2
吴慈霆
  • 523
  • 2
  • 15
  • do you require it to be done on GPUs? – anon01 Dec 29 '21 at 08:09
  • Yes, since the matrices are really large, multiply them on CPUs may take hours. Based on my experiments, it only takes minutes using one GPU. – 吴慈霆 Dec 29 '21 at 08:11
  • 2
    Consider pytorch (or maybe tensorflow). It is well supported and integrates closely with numpy. I've had mixed results with pyopencl and numba. – anon01 Dec 29 '21 at 10:13

3 Answers3

2

Dask supports array operations (including matrix multiplication) on GPUs via CuPy backed arrays. You can use a multi-node, multi-GPU cluster with Dask-CUDA.

Nick Becker
  • 4,059
  • 13
  • 19
0

Look into the "cuBLAS Multi-GPU Extension": https://developer.nvidia.com/cublas

You'll have to apply for the early access program. Existing python libraries probably won't take advantage of this extension but you may be able to just enable it once you update your CUDA libraries. You'd have to read the documentation once you have access.

Jeff
  • 1,234
  • 8
  • 16
-1

Python has a special library:https://documen.tician.de/pycuda/

Simple example:

import pycuda.autoinit
import pycuda.driver as drv
import numpy

from pycuda.compiler import SourceModule
mod = SourceModule("""
__global__ void multiply_them(float *dest, float *a, float *b)
{
  const int i = threadIdx.x;
  dest[i] = a[i] * b[i];
}
""")

multiply_them = mod.get_function("multiply_them")

a = numpy.random.randn(400).astype(numpy.float32)
b = numpy.random.randn(400).astype(numpy.float32)

dest = numpy.zeros_like(a)
multiply_them(
        drv.Out(dest), drv.In(a), drv.In(b),
        block=(400,1,1), grid=(1,1))

print dest-a*b
lazy
  • 744
  • 3
  • 13