-3

I need to multiply a matrix with its transpose and I am running out of memory on my GPU with eror message numba.cuda.cudadrv.driver.CudaAPIError: [2] Call to cuMemAlloc results in CUDA_ERROR_OUT_OF_MEMORY

I am expecting the size of my matrix to be around 10k rows and 100k columns so multiplying it with its trnspose will give a result of a square matrix of 10k rows and 10k columns. The matrix only contains 0 and 1.

This is the script that I am running.

from numba import cuda, uint16
import numba
import numpy
import math
import time

TPB = 16

@cuda.jit()
def matmul_shared_mem(A, B, C):
    sA = cuda.shared.array((TPB, TPB), dtype=uint16)
    sB = cuda.shared.array((TPB, TPB), dtype=uint16)
    x, y = cuda.grid(2)
    tx = cuda.threadIdx.x
    ty = cuda.threadIdx.y
    if x >= C.shape[0] and y >= C.shape[1]:
        return
    tmp = 0.
    for i in range(int(A.shape[1] / TPB)):
        sA[tx, ty] = A[x, ty + i * TPB]
        sB[tx, ty] = B[tx + i * TPB, y]
        cuda.syncthreads()
        for j in range(TPB):
            tmp += sA[tx, j] * sB[j, ty]
        cuda.syncthreads()
    C[x, y] = tmp

A = numpy.random.randint(2, size=(TPB * 625, 50000))

B = A.transpose()

C_shared_mem = cuda.device_array((A.shape[0], B.shape[1]))

threads_per_block = (TPB, TPB)
blocks_per_grid_x = int(math.ceil(A.shape[0] / threads_per_block[0]))
blocks_per_grid_y = int(math.ceil(B.shape[1] / threads_per_block[1]))
blocks_per_grid = (blocks_per_grid_x, blocks_per_grid_y)

start_gpu_shared_memory = time.time()
matmul_shared_mem[blocks_per_grid, threads_per_block](A, B, C_shared_mem)
cuda.synchronize()
end_gpu_shared_memory = time.time()

time_gpu_shared = end_gpu_shared_memory - start_gpu_shared_memory
print("GPU time(shared memory):" + str(time_gpu_shared))

Update 1:

Based on your suggestions, I made certain changes but well I am still running out of memory.

import numpy as np
import numba as nb
colm = int(200000/8)
rows = 100000
cols = int(colm*8)
AU = np.random.randint(2,size=(rows, cols),dtype=np.int8)
A = np.empty((rows,colm), dtype=np.uint8)

@nb.njit('void(uint8[:,:],int8[:,:])', parallel=True)
def compute(A, AU):
    for i in nb.prange(A.shape[0]):
        for j in range(A.shape[1]):
            offset = j * 8
            res = AU[i,offset] << 7
            res |= AU[i,offset+1] << 6
            res |= AU[i,offset+2] << 5
            res |= AU[i,offset+3] << 4
            res |= AU[i,offset+4] << 3
            res |= AU[i,offset+5] << 2
            res |= AU[i,offset+6] << 1
            res |= AU[i,offset+7]
            A[i,j] = res

compute(A, AU)

from numba import cuda, uint8, int32
import numba
import numpy as np
import math
import time

TPB = 8
TPB1 = 9

@cuda.jit()
def bit_A_AT(A, C):
    sA = cuda.shared.array((TPB, TPB), dtype=uint8)
    sB = cuda.shared.array((TPB, TPB1), dtype=uint8)
    x, y = cuda.grid(2)
    tx = cuda.threadIdx.x
    ty = cuda.threadIdx.y
    bx = cuda.blockIdx.x
    by = cuda.blockIdx.y
    if bx >= by:
        tmp = int32(0)
        for i in range((A.shape[1]+TPB-1) // TPB):
            if y < A.shape[0] and (i*TPB+tx) < A.shape[1]:
                sA[ty, tx] = A[y, i*TPB+tx]
            else:
                sA[ty, tx] = 0
            if (TPB*bx+ty) < A.shape[0] and (i*TPB+tx) < A.shape[1]:
                sB[ty, tx] = A[TPB*bx+ty, i*TPB+tx]
            else:
                sB[ty, tx] = 0
            cuda.syncthreads()
            for j in range(TPB):
                tmp1 = sA[ty,j] & sB[tx, j]
                test = uint8(1)
                for k in range(8):
                    if (tmp1 & test) > 0:
                        tmp += 1
                    test <<= 1
            cuda.syncthreads()
        if y < C.shape[0] and x < C.shape[1]:
            C[y, x] = tmp

C = np.empty((A.shape[0], A.shape[0]), dtype=np.int32)
threads_per_block = (TPB, TPB)
blocks_per_grid_x = int(math.ceil(A.shape[0] / threads_per_block[0]))
blocks_per_grid_y = int(math.ceil(A.shape[0] / threads_per_block[1]))
blocks_per_grid = (blocks_per_grid_x, blocks_per_grid_y)

start_gpu_shared_memory = time.time()
bit_A_AT[blocks_per_grid, threads_per_block](A, C)
cuda.synchronize()
end_gpu_shared_memory = time.time()

time_gpu_shared = end_gpu_shared_memory - start_gpu_shared_memory
print("GPU time(shared memory):" + str(time_gpu_shared))

Any idea how I can fiix this?

secretive
  • 2,032
  • 7
  • 16
  • 1
    1. You are using unnecessarily large types. Some of your types are 64-bit, and you are mixing types, which is bad. Use a consistent 32-bit `dtype` throughout. That will cut your memory usage in half. Either `int32` or `float32` should be OK. 2. To cut your memory usage in half again, use the method [here](https://stackoverflow.com/questions/59913917/access-an-matrix-as-its-tranpose-in-tiled-matrix-mutliplication-in-cuda/59942630#59942630). Since the matrix and its transpose can be extracted from the same data, there is no need to transpose on the host and copy both to the device. – Robert Crovella Apr 13 '21 at 15:23
  • 3. You have other issues, I suggest reviewing [this](https://stackoverflow.com/questions/64197780/how-to-generalize-fast-matrix-multiplication-on-gpu-using-numba/64198479#64198479). – Robert Crovella Apr 13 '21 at 15:24
  • I changed the data type to `unit16` to save space but that still does not solve my memory problem. I am new to this so not sure but i did not send my two matrix A and B to GPU memory in the code so they should still be on host memory. right? – secretive Apr 13 '21 at 15:32
  • Also, I dont know C language that well. – secretive Apr 13 '21 at 15:33
  • when you do this: `matmul_shared_mem[blocks_per_grid, threads_per_block](A, B, C_shared_mem)` you are sending A and B to the device. numba does that automatically. For matrices of 10kx100k, with a 10kx10k result, the theoretical maximum value if the matrix elements are only 0, 1 is 100,000, which would overflow a 16-bit integer. It might be helpful if you indicate the GPU you are running on, and how much memory is on that GPU. For kernel code, the translation from C to python is more-or-less mechanical. You don't have to know much about C, just what the numba equivalents are. – Robert Crovella Apr 13 '21 at 15:43
  • and the arrays that are taking up space here are the A, B, C arrays, not your shared arrays. Changing the type of the shared array (only) is wrong, and will do nothing to solve the memory problem. – Robert Crovella Apr 13 '21 at 15:45
  • The GPU is Tesla M60 with 8 gigs of memory. – secretive Apr 13 '21 at 15:49
  • properly decorating your `A` and `C_shared_mem ` declarations with `dtype=numpy.uint16` should get you past the out of memory error. – Robert Crovella Apr 13 '21 at 16:15
  • Yeah but as you said and I agree that 16 bits won't be enough. I will have to look into 32 bits for sure. I am checking the C code to see if i can somehow write it in Python but that is gonna be a challenge. – secretive Apr 13 '21 at 16:21
  • If you change all references of uint16 to uint32 you can still fit it in 8GB. That will use about 4.8GB of memory. – Robert Crovella Apr 13 '21 at 16:23
  • But I would still need to solve the part where I do not use B matrix and just use the A matrix. I will have to change the `sB` part and the calculation to reflect that. Is there any other way to scale this... say if I have 30k rows and 200k columns of all 0 and 1. – secretive Apr 13 '21 at 16:28
  • Not sure why so many negative votes. I am guessing to remove B matrix, I have to change `sB[tx, ty] = B[tx + i * TPB, y]` to basically access matrix A in some fashion but I am not sure how to do that. Any advice? – secretive Apr 15 '21 at 20:19
  • I have been reading about it but if I am not able to understand it and cannot find documentation for it or documentation contains broken example, then stackoverflow is the only place I have. And I dont know C so that goes out the window for me. – secretive Apr 16 '21 at 04:23

1 Answers1

5

The following method should reduce the amount of device memory required for the calculation of A x AT. We'll use the following ideas:

  • since the input array (A) only takes on values of 0,1, we'll reduce the storage for that array down to the minimum convenient size, int8, i.e. one byte per element
  • since the B array is just the transpose of the A array, there is no need to handle it explicitly. We can derive it from the A array, somehwhat similar to here, although that is performing AT x A
  • the matrix multiplication of A x AT involves taking the dot-product of the rows of matrix A, as indicated here
  • we will provide the A transposed version in the sB array using adjusted indexing
  • there are a range of other changes to your code, to address various errors and also to improve load/store efficiency, such as a general reversal of your usage of x,y indices
  • I've also fixed your usage of syncthreads and modified the code to allow arbitrary values for row and column dimensions

Here is a worked example:

$ cat t62.py
from numba import cuda, int32, int8
import numba
import numpy as np
import math
import time

TPB = 32
TPB1 = TPB+1
@cuda.jit()
def byte_A_AT(A, C):
    sA = cuda.shared.array((TPB, TPB),  dtype=int8)
    sB = cuda.shared.array((TPB, TPB1), dtype=int8)
    x, y = cuda.grid(2)
    tx = cuda.threadIdx.x
    ty = cuda.threadIdx.y
    bx = cuda.blockIdx.x
    by = cuda.blockIdx.y
# uncomment and indent remainder of kernel to only do the "symmetric half" of calculation
#    if bx >= by:
    tmp = int32(0)
    for i in range((A.shape[1]+TPB-1)// TPB):
        if y < A.shape[0] and (i*TPB+tx) < A.shape[1]:
            sA[ty, tx] = A[y, i*TPB+tx]
        else:
            sA[ty, tx] = 0
        if (TPB*bx+ty) < A.shape[0] and (i*TPB+tx) < A.shape[1]:
            sB[ty, tx] = A[TPB*bx+ty, i*TPB+tx]
        else:
            sB[ty, tx] = 0
        cuda.syncthreads()
        for j in range(TPB):
            tmp += int32(sA[ty,j]) * int32(sB[tx, j])
        cuda.syncthreads()
    if y < C.shape[0] and x < C.shape[1]:
        C[y, x] = tmp
rows = 1041
cols = 1043
print('host mem: ', (rows*cols*2+rows*rows*4*2)//1048576, 'MB device mem: ', (rows*cols+rows*rows*4)//1048576, 'MB')
A = np.random.randint(2,size=(rows, cols),dtype=np.int8)
AT = A.transpose()
CU = np.matmul(A,AT, dtype = np.int32)
C = np.empty((A.shape[0], A.shape[0]), dtype=np.int32)
threads_per_block = (TPB, TPB)
blocks_per_grid_x = int(math.ceil(A.shape[0] / threads_per_block[0]))
blocks_per_grid_y = int(math.ceil(A.shape[0] / threads_per_block[1]))
blocks_per_grid = (blocks_per_grid_x, blocks_per_grid_y)
byte_A_AT[blocks_per_grid, threads_per_block](A, C)
cuda.synchronize()
start_gpu_shared_memory = time.time()
byte_A_AT[blocks_per_grid, threads_per_block](A, C)
cuda.synchronize()
end_gpu_shared_memory = time.time()

time_gpu_shared = end_gpu_shared_memory - start_gpu_shared_memory
print("GPU time(shared memory):" + str(time_gpu_shared))
test = np.array_equal(C, CU)
print(test)
if test == False:
    for i in range(C.shape[0]):
        for j in range(C.shape[1]):
            if C[i,j] != CU[i,j]:
                print(i, ' ' , j ,' ' , C[i,j] , ' ' , CU[i,j])
$ python t62.py
host mem:  10 MB device mem:  5 MB
GPU time(shared memory):0.019593000411987305
True
$

Notes:

  • most of the runtime of the above code will be spent in python (in the np.matmul() operation, which is really only used to verify results, it should not be necessary for an actual implementation), not in the GPU portion. As the matrices are made larger, the code will run much more slowly.
  • as mentioned in the comments, the result of A x AT is a symmetric matrix. My code does not take advantage of this, however we could crudely take advantage of it by uncommenting the if test at the beginning of the kernel and then indenting the remainder of the kernel. However this will cause the host code np.array_equal test to fail, of course.
  • the device memory consumption for this is calculated in the code. for your largest value in the comments (rows = 30k, cols = 200k) this would amount to about 10GB, so it will still not run in your 8GB GPU.
  • I have created a version of this code which packs 8 elements per byte for the A matrix, which would further reduce memory demand, however writing this code to handle the case of arbitrary column dimensions (vs. multiples of 8) proves to be rather messy. However that code could get the total device memory consumption down to about 5GB for 30k rows and 200k columns case.

Here is the one bit per element version, it has a requirement that the number of columns be whole-number divisible by 8:

$ cat t61.py
from numba import cuda, uint8, int32
import numba
import numpy as np
import math
import time

TPB = 32
TPB1 = 33

@cuda.jit()
def bit_A_AT(A, C):
    sA = cuda.shared.array((TPB, TPB), dtype=uint8)
    sB = cuda.shared.array((TPB, TPB1), dtype=uint8)
    x, y = cuda.grid(2)
    tx = cuda.threadIdx.x
    ty = cuda.threadIdx.y
    bx = cuda.blockIdx.x
    by = cuda.blockIdx.y
    tmp = int32(0)
    for i in range((A.shape[1]+TPB-1) // TPB):
        if y < A.shape[0] and (i*TPB+tx) < A.shape[1]:
            sA[ty, tx] = A[y, i*TPB+tx]
        else:
            sA[ty, tx] = 0
        if (TPB*bx+ty) < A.shape[0] and (i*TPB+tx) < A.shape[1]:
            sB[ty, tx] = A[TPB*bx+ty, i*TPB+tx]
        else:
            sB[ty, tx] = 0
        cuda.syncthreads()
        for j in range(TPB):
            tmp1 = sA[ty,j] & sB[tx, j]
            test = uint8(1)
            for k in range(8):
                if (tmp1 & test) > 0:
                    tmp += 1
                test <<= 1
        cuda.syncthreads()
    if y < C.shape[0] and x < C.shape[1]:
        C[y, x] = tmp
colm = 131
rows = 1041
cols = int(colm*8)
print('host mem: ', (rows*cols*2+rows*rows*4*2)//1048576, 'MB device mem: ', (((rows*cols)//8)+rows*rows*4)//1048576, 'MB')
AU = np.random.randint(2,size=(rows, cols),dtype=np.int8)
AUT = AU.transpose()
CU = np.matmul(AU,AUT,dtype=np.int32)
A = np.empty((rows,colm), dtype=np.uint8)
for i in range(A.shape[0]):
    for j in range(A.shape[1]):
        A[i,j] = 0
        for k in range(8):
            if AU[i,(j*8)+k] == 1:
                A[i,j] = A[i,j] | (1<<(7-k))
C = np.empty((A.shape[0], A.shape[0]), dtype=np.int32)
threads_per_block = (TPB, TPB)
blocks_per_grid_x = int(math.ceil(A.shape[0] / threads_per_block[0]))
blocks_per_grid_y = int(math.ceil(A.shape[0] / threads_per_block[1]))
blocks_per_grid = (blocks_per_grid_x, blocks_per_grid_y)
bit_A_AT[blocks_per_grid, threads_per_block](A, C)
cuda.synchronize()

start_gpu_shared_memory = time.time()
bit_A_AT[blocks_per_grid, threads_per_block](A, C)
cuda.synchronize()
end_gpu_shared_memory = time.time()

time_gpu_shared = end_gpu_shared_memory - start_gpu_shared_memory
print("GPU time(shared memory):" + str(time_gpu_shared))
test = np.array_equal(C, CU)
print(test)
if test == False:
    for i in range(C.shape[0]):
        for j in range(C.shape[1]):
            if C[i,j] != CU[i,j]:
                print(i, ' ' , j ,' ' , C[i,j] , ' ' , CU[i,j])
                break
$ python t61.py
host mem:  10 MB device mem:  4 MB
GPU time(shared memory):0.009343624114990234
True
$

EDIT: Responding to some questions in the comments, updates, and now taking into account that the A matrix may have significantly more than 30k rows, this will cause the C matrix to increase as well. If the A matrix can be fit in GPU memory, we can reduce the memory demand of the C matrix by computing it in pieces. These pieces will be a group of rows computed together, which I refer to as a row_slice of the C matrix. The following code demonstrates that this can be achieved with relatively minor changes to the code above:

$ cat t63.py
from numba import cuda, uint8, int32
import numba as nb
import numpy as np
import math
import time

TPB = 32
TPB1 = 33

@cuda.jit()
def bit_slice_A_AT(A, C, row_offset):
    sA = cuda.shared.array((TPB, TPB), dtype=uint8)
    sB = cuda.shared.array((TPB, TPB1), dtype=uint8)
    x, y = cuda.grid(2)
    tx = cuda.threadIdx.x
    ty = cuda.threadIdx.y
    bx = cuda.blockIdx.x
    by = cuda.blockIdx.y
    tmp = int32(0)
    for i in range((A.shape[1]+TPB-1) // TPB):
        if y+row_offset < A.shape[0] and (i*TPB+tx) < A.shape[1]:
            sA[ty, tx] = A[y+row_offset, i*TPB+tx]
        else:
            sA[ty, tx] = 0
        if (TPB*bx+ty) < A.shape[0] and (i*TPB+tx) < A.shape[1]:
            sB[ty, tx] = A[TPB*bx+ty, i*TPB+tx]
        else:
            sB[ty, tx] = 0
        cuda.syncthreads()
        for j in range(TPB):
            tmp1 = sA[ty,j] & sB[tx, j]
            test = uint8(1)
            for k in range(8):
                if (tmp1 & test) > 0:
                    tmp += 1
                test <<= 1
        cuda.syncthreads()
    if y < C.shape[0] and x < C.shape[1]:
        C[y, x] = tmp

@nb.njit('void(uint8[:,:],int8[:,:])', parallel=True)
def bitpack(A, AU):
    for i in nb.prange(A.shape[0]):
        for j in range(A.shape[1]):
            offset = j * 8
            res = AU[i,offset] << 7
            res |= AU[i,offset+1] << 6
            res |= AU[i,offset+2] << 5
            res |= AU[i,offset+3] << 4
            res |= AU[i,offset+4] << 3
            res |= AU[i,offset+5] << 2
            res |= AU[i,offset+6] << 1
            res |= AU[i,offset+7]
            A[i,j] = res

colm = 131
rows = 1535
cols = int(colm*8)
row_slice = 512
print('host mem: ', (rows*cols*2+rows*rows*4*2)//1048576, 'MB device mem: ', (((rows*cols)//8)+row_slice*rows*4)//1048576, 'MB')
AU = np.random.randint(2,size=(rows, cols),dtype=np.int8)
CU = np.matmul(AU,AU.T,dtype=np.int32)
A = np.empty((rows,colm), dtype=np.uint8)
bitpack(A, AU)
A_dev = cuda.to_device(A)
threads_per_block = (TPB, TPB)
C = np.empty((row_slice, A.shape[0]), dtype=np.int32)
blocks_per_grid_x = int(math.ceil(A.shape[0] / threads_per_block[0]))
blocks_per_grid_y = int(row_slice / threads_per_block[1])
blocks_per_grid = (blocks_per_grid_x, blocks_per_grid_y)
for i in range((A.shape[0]+row_slice-1)//row_slice):
    bit_slice_A_AT[blocks_per_grid, threads_per_block](A_dev, C, i*row_slice)
    lower = i*row_slice
    upper = min(lower+row_slice, CU.shape[0])
    width = upper-lower
    test = np.array_equal(C[:width,:], CU[i*row_slice:i*row_slice+width,:])
    print(test)
cuda.synchronize()
C_dev = cuda.device_array_like(C)
start_gpu_shared_memory = time.time()
for i in range((A.shape[0]+row_slice-1)//row_slice):
    bit_slice_A_AT[blocks_per_grid, threads_per_block](A_dev, C_dev, i*row_slice)
cuda.synchronize()
end_gpu_shared_memory = time.time()

time_gpu_shared = end_gpu_shared_memory - start_gpu_shared_memory
print("GPU time(shared memory):" + str(time_gpu_shared))
$ python t63.py
host mem:  21 MB device mem:  3 MB
True
True
True
GPU time(shared memory):0.010116815567016602
$

This means, as suggested, for the case of rows = 100k and columns = 200k as given in the latest update to the question, we should be able to divide the C matrix into chunks of say 5k rows. The memory usage for the A matrix would be 2.5GB, but for the C matrix, since we are only computing a 5k row slice at a time, the device memory storage required would be 100k*5k*4 bytes, so 2GB for this example.

After some further study, we can speed up the host code matmul operation by switching from int8 datatype to float32 datatype. This makes that op quite a bit faster, but the GPU code still seems to ~4x faster than that:

$ cat t64.py
from numba import cuda, uint8, int32
import numba as nb
import numpy as np
import math
import time

TPB = 32
TPB1 = 33

@cuda.jit()
def bit_slice_A_AT(A, C, row_offset):
    sA = cuda.shared.array((TPB, TPB), dtype=uint8)
    sB = cuda.shared.array((TPB, TPB1), dtype=uint8)
    x, y = cuda.grid(2)
    tx = cuda.threadIdx.x
    ty = cuda.threadIdx.y
    bx = cuda.blockIdx.x
    by = cuda.blockIdx.y
    tmp = int32(0)
    for i in range((A.shape[1]+TPB-1) // TPB):
        if y+row_offset < A.shape[0] and (i*TPB+tx) < A.shape[1]:
            sA[ty, tx] = A[y+row_offset, i*TPB+tx]
        else:
            sA[ty, tx] = 0
        if (TPB*bx+ty) < A.shape[0] and (i*TPB+tx) < A.shape[1]:
            sB[ty, tx] = A[TPB*bx+ty, i*TPB+tx]
        else:
            sB[ty, tx] = 0
        cuda.syncthreads()
        for j in range(TPB):
            tmp1 = sA[ty,j] & sB[tx, j]
            test = uint8(1)
            for k in range(8):
                if (tmp1 & test) > 0:
                    tmp += 1
                test <<= 1
        cuda.syncthreads()
    if y < C.shape[0] and x < C.shape[1]:
        C[y, x] = tmp

@nb.njit('void(uint8[:,:],float32[:,:])', parallel=True)
def bitpack(A, AU):
    for i in nb.prange(A.shape[0]):
        for j in range(A.shape[1]):
            offset = j * 8
            res = int(AU[i,offset]) << 7
            res |= int(AU[i,offset+1]) << 6
            res |= int(AU[i,offset+2]) << 5
            res |= int(AU[i,offset+3]) << 4
            res |= int(AU[i,offset+4]) << 3
            res |= int(AU[i,offset+5]) << 2
            res |= int(AU[i,offset+6]) << 1
            res |= int(AU[i,offset+7])
            A[i,j] = res

colm = 1000
rows = 6000
cols = int(colm*8)
row_slice = 512
print('host mem: ', (rows*cols*4+rows*colm+rows*rows*4+rows*row_slice*4)//1048576, 'MB device mem: ', (((rows*cols)//8)+row_slice*rows*4)//1048576, 'MB')
t1 = time.time()
AU = np.random.randint(2,size=(rows, cols),dtype=np.int8)
AU = AU.astype(np.float32)
print("randint:" + str(time.time()-t1))
t1 = time.time()
#CU = np.empty((rows, rows), dtype=np.int32)
CU = np.matmul(AU,AU.T,dtype=np.float32)
print("matmul:" + str(time.time()-t1))
t1 = time.time()
A = np.empty((rows,colm), dtype=np.uint8)
print("np.empty:" + str(time.time()-t1))
t1 = time.time()
bitpack(A, AU)
print("bitpack:" + str(time.time()-t1))
t1 = time.time()
A_dev = cuda.to_device(A)
threads_per_block = (TPB, TPB)
C = np.empty((row_slice, A.shape[0]), dtype=np.int32)
blocks_per_grid_x = int(math.ceil(A.shape[0] / threads_per_block[0]))
blocks_per_grid_y = int(row_slice / threads_per_block[1])
blocks_per_grid = (blocks_per_grid_x, blocks_per_grid_y)
for i in range((A.shape[0]+row_slice-1)//row_slice):
    bit_slice_A_AT[blocks_per_grid, threads_per_block](A_dev, C, i*row_slice)
    lower = i*row_slice
    upper = min(lower+row_slice, CU.shape[0])
    width = upper-lower
    test = np.array_equal(C[:width,:], CU[i*row_slice:i*row_slice+width,:])
    print(test)
cuda.synchronize()
C_dev = cuda.device_array_like(C)
start_gpu_shared_memory = time.time()
for i in range((A.shape[0]+row_slice-1)//row_slice):
    bit_slice_A_AT[blocks_per_grid, threads_per_block](A_dev, C_dev, i*row_slice)
cuda.synchronize()
end_gpu_shared_memory = time.time()

time_gpu_shared = end_gpu_shared_memory - start_gpu_shared_memory
print("GPU time(shared memory):" + str(time_gpu_shared))
$ python t64.py
host mem:  337 MB device mem:  17 MB
randint:0.1817936897277832
matmul:3.498671531677246
np.empty:7.62939453125e-05
bitpack:0.03707313537597656
True
True
True
True
True
True
True
True
True
True
True
True
GPU time(shared memory):0.8318064212799072
$

I haven't thoroughly tested these codes. Bugs may exist. Use at your own risk.

For attribution, the numba bit-packing code seems to have come from here.

Robert Crovella
  • 143,785
  • 11
  • 213
  • 257
  • Amazing idea. I really appreciate the response. The second solution is genius though it takes a long time on CPU since it has to traverse the matrix and assign bits but great idea. The first solution does work great. Since most of the values in my matrix will be 0, I am currently looking into doing sparse matrix multiplication as that way I can save a lot of space. – secretive Apr 19 '21 at 19:51
  • If I change my instance to one with 4 GPUs each having 8GB, will Numba use all of them by default for GPU memory i.e. total GPU memory 32gigs? – secretive Apr 20 '21 at 16:23
  • no it will not. That requires explicit multi-GPU programming. – Robert Crovella Apr 20 '21 at 16:58
  • Please, can you check the update? Thanks. – secretive Apr 23 '21 at 13:47
  • A should use about 2.5GB. C should use 100,000 * 100,000 * 4 bytes = 40GB. So that is your issue. Your original problem formulation had a much smaller dimension for the rows of A (A.shape[0]). When you went from 10k rows or 30k rows to 100k rows, that made the output array blow up. Please leave in the line which calculates and prints the memory required, then you won't have to ask me via the comments to make the calculation. There is nothing I can do to fix this. If you change the rows back to 10k or 30k, I think your code should not hit the out of memory error. – Robert Crovella Apr 23 '21 at 14:06
  • Yep, I calculated it separately and dont have it as part of the main code. I will have to try with 100k ids I was just curious if there was anything else that can be done. if there isnt, I appreciate the help. I might try to break down the huge matrix into chunks since I only want to calculate the matrix product where there is a chance that the result will be higher than a certain threshold. If it wont be higher, I will just skip those part. Lets see if that works – secretive Apr 23 '21 at 14:23
  • Essentially, the matrix is sort of like a one hot encoded matrix where the row represents a person and the column is whether the person has an item. The data is extremely sparse, so if I arrange the rows and columns in a way that most of the people are separable from each other, I dont have to compute the dot product betwene them. So matrix will be devided into chunks and then I only perform multiplication on certain chunks. I hope I am making sense here since it might be difficult to understand without the whole context. – secretive Apr 23 '21 at 14:45
  • I needed help. So I dont have to create a full 100k*100k C matrix... I can send two matrices to the function where one matrix will be the original matrix with all rows and the second matrix will be transpose of 5k rows. I can easily multiply this over the GPU and save it back to file and reset C and do this in a loop. Or I can also create a RAM vaariable and write to that. This will solve my scaling issue. Can you please help? I tried to do it on my own but couldnt do it and wasnt able to figure out. Really appreciate the help. I am running your bit mapping code where I scaled the bit mapping. – secretive Apr 23 '21 at 21:53
  • I've made an update, showing how to compute row "slices" of the C matrix at a time. – Robert Crovella Apr 24 '21 at 21:50
  • Thanks for the update. For 100k by 200k matrix, when sliced, it will take roughly about 13 hours or something to run. I created a main matrix CU and then I am assigning the slice computation from GPU to cpu array. Sorry for so many questions, but i am totally new to this, and working on aws instance so cant really debug everything to understand properly. is there any way to make this faster or even more parallel. Even if i increase my GPU to lets say 16gb, i am guessing it will still take this much time. – secretive Apr 27 '21 at 18:55
  • BTW I made this new question as well https://stackoverflow.com/questions/67289661/scale-parallelize-further-using-cuda-or-numba as this was getting stretched deep into comments – secretive Apr 27 '21 at 19:58
  • Had an idea... so i was working with pyspark with sparse representation and it hit me, that this problem can be done using that too but a different approach. Instead of matrix multiplication, what if I have indexes of items for every entity and I do an intersection between that entity and others. Since my matrix is highly sparse, I don't have to deal with a lot of zeros this way and no multiplication either but an intersection... can you take a look. thank you https://stackoverflow.com/questions/67407420/list-intersection-over-gpu – secretive May 05 '21 at 19:16
  • Thanks a lot for all the help. I tried to use the symmetric idea along with computing the row slice. But since we are computing things in row slices i.e. smaller matrices now, it cannot be done (or at least I couldn't do it). I am using symmetric when total rows can all fit into GPU at same time and it reduces computation time by half. If total rows are more than row_slices, then i cannot use the symmetric matrix calculation idea. But thank you for the help, learned a lot from this. – secretive May 21 '21 at 21:19