0

I try to sum many vectors values using CUDA python. I found one solution using shared memory Here. Is there a way to do this without shared memory [because of the small amount of memory that shared memory have]? My vectors size are:

N = 1000
i = 300000
v[i] = [1,2,..., N]

As result i need to get:

out[i]= [sum(v[1]), sum(v[2]),..., sum(v[i])]

Thanks for any advice :)

talonmies
  • 70,661
  • 34
  • 192
  • 269
  • The solution you linked to is not restricted by shared memory size. It can sum any sized input per block. The shared memory is only used in the final block wise reduction sum. – talonmies May 22 '18 at 08:57
  • Then Is the any simplest way to do this? –  May 22 '18 at 11:09

1 Answers1

1

To perform multiple reductions at once, and for the problem dimensions you indicate, it will be important to understand whether your vectors are stored row-wise in memory or column-wise.

For the row-wise storage method, a block-wise parallel reduction method should be pretty fast. Each block will perform a standard sweep-based parallel reduction for a single vector, then write the result as a single number to the output.

For the column-wise storage method, for the problem dimensions you indicate (in particular, a "large" number of vectors), it will be efficient to have each thread perform the reduction across the vector using a simple loop traversing the column.

Here is a worked example of both methods:

# cat t7.py
import numpy as np
import numba as nb
from numba import cuda,float32,int32

#vector length
N = 1000
#number of vectors
NV = 300000
#number of threads per block - must be a power of 2 less than or equal to 1024
threadsperblock = 256
#for vectors arranged row-wise
@cuda.jit('void(float32[:,:], float32[:])')
def vec_sum_row(vecs, sums):
    sm = cuda.shared.array(threadsperblock, float32)
    bid = cuda.blockIdx.x
    tid = cuda.threadIdx.x
    bdim = cuda.blockDim.x
# load shared memory with vector using block-stride loop
    lid = tid
    sm[lid] = 0
    while lid < N:
        sm[tid] += vecs[bid, lid];
        lid += bdim
    cuda.syncthreads()
# perform shared memory sweep reduction
    sweep = bdim//2
    while sweep > 0:
        if tid < sweep:
            sm[tid] += sm[tid + sweep]
        sweep = sweep//2
        cuda.syncthreads()
    if tid == 0:
        sums[bid] = sm[0]

#for vectors arranged column-wise
@cuda.jit('void(float32[:,:], float32[:])')
def vec_sum_col(vecs, sums):
    idx = cuda.grid(1)
    if idx >= NV:
        return
    temp = 0
    for i in range(N):
        temp += vecs[i,idx]
    sums[idx] = temp

#peform row-test
rvecs  = np.ones((NV, N), dtype=np.float32)
sums   = np.zeros(NV, dtype=np.float32)
d_rvecs = cuda.to_device(rvecs)
d_sums = cuda.device_array_like(sums)
vec_sum_row[NV, threadsperblock](d_rvecs, d_sums)
d_sums.copy_to_host(sums)
print(sums[:8])

#perform column-test
cvecs = np.ones((N, NV), dtype=np.float32)
d_cvecs = cuda.to_device(cvecs)
vec_sum_col[(NV+threadsperblock-1)//threadsperblock, threadsperblock](d_cvecs, d_sums)
d_sums.copy_to_host(sums)
print(sums[:8])
# python t7.py
[1000. 1000. 1000. 1000. 1000. 1000. 1000. 1000.]
[1000. 1000. 1000. 1000. 1000. 1000. 1000. 1000.]
# nvprof python t7.py
==5931== NVPROF is profiling process 5931, command: python t7.py
[1000. 1000. 1000. 1000. 1000. 1000. 1000. 1000.]
[1000. 1000. 1000. 1000. 1000. 1000. 1000. 1000.]
==5931== Profiling application: python t7.py
==5931== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   99.20%  1.12464s         2  562.32ms  557.25ms  567.39ms  [CUDA memcpy HtoD]
                    0.59%  6.6881ms         1  6.6881ms  6.6881ms  6.6881ms  cudapy::__main__::vec_sum_row$241(Array<float, int=2, A, mutable, aligned>, Array<float, int=1, A, mutable, aligned>)
                    0.20%  2.2250ms         1  2.2250ms  2.2250ms  2.2250ms  cudapy::__main__::vec_sum_col$242(Array<float, int=2, A, mutable, aligned>, Array<float, int=1, A, mutable, aligned>)
                    0.02%  212.83us         2  106.42us  104.45us  108.38us  [CUDA memcpy DtoH]
      API calls:   73.60%  1.12571s         2  562.85ms  557.77ms  567.94ms  cuMemcpyHtoD
                   25.30%  386.91ms         1  386.91ms  386.91ms  386.91ms  cuDevicePrimaryCtxRetain
                    0.64%  9.8042ms         2  4.9021ms  2.6113ms  7.1929ms  cuMemcpyDtoH
                    0.23%  3.4945ms         3  1.1648ms  182.38us  1.6636ms  cuMemAlloc
                    0.07%  999.98us         2  499.99us  62.409us  937.57us  cuLinkCreate
                    0.04%  678.12us         2  339.06us  331.01us  347.12us  cuModuleLoadDataEx
                    0.03%  458.51us         1  458.51us  458.51us  458.51us  cuMemGetInfo
                    0.03%  431.28us         4  107.82us  98.862us  120.58us  cuDeviceGetName
                    0.03%  409.59us         2  204.79us  200.33us  209.26us  cuLinkAddData
                    0.03%  393.75us         2  196.87us  185.18us  208.56us  cuLinkComplete
                    0.01%  218.68us         2  109.34us  79.726us  138.96us  cuLaunchKernel
                    0.00%  14.052us         3  4.6840us     406ns  11.886us  cuDeviceGetCount
                    0.00%  13.391us        12  1.1150us     682ns  1.5910us  cuDeviceGetAttribute
                    0.00%  13.207us         8  1.6500us  1.0110us  3.1970us  cuDeviceGet
                    0.00%  6.6800us        10     668ns     366ns  1.6910us  cuFuncGetAttribute
                    0.00%  6.3560us         1  6.3560us  6.3560us  6.3560us  cuCtxPushCurrent
                    0.00%  4.1940us         2  2.0970us  1.9810us  2.2130us  cuModuleGetFunction
                    0.00%  4.0220us         4  1.0050us     740ns  1.7010us  cuDeviceComputeCapability
                    0.00%  2.5810us         2  1.2900us  1.1740us  1.4070us  cuLinkDestroy
#

If you have a choice of storage methods, the column-wise storage is preferred for performance. In the above example, the row-sum kernel took about 6.7ms whereas the column-sum kernel took about 2.2ms. The row-wise method above could possibly be improved somewhat by launching a smaller number of blocks and having each block perform multiple reductions using a loop, but it is unlikely to be faster than the column method.

Note that this code requires about 1.5GB of storage for each test (row and column) so it won't run as-is on a GPU that has a very small amount of memory (e.g. 2GB GPU). You may be able to get it to run on a small memory GPU by either doing only the row test or the column test, or else by reducing the number of vectors, for example.

Robert Crovella
  • 143,785
  • 11
  • 213
  • 257
  • Thank You very much Robert :). That solve my problem. Computation of sum will by row-wise. My GPU have 2GB memory that is why amount of vectors is about 300000 –  May 22 '18 at 17:44