To perform multiple reductions at once, and for the problem dimensions you indicate, it will be important to understand whether your vectors are stored row-wise in memory or column-wise.
For the row-wise storage method, a block-wise parallel reduction method should be pretty fast. Each block will perform a standard sweep-based parallel reduction for a single vector, then write the result as a single number to the output.
For the column-wise storage method, for the problem dimensions you indicate (in particular, a "large" number of vectors), it will be efficient to have each thread perform the reduction across the vector using a simple loop traversing the column.
Here is a worked example of both methods:
# cat t7.py
import numpy as np
import numba as nb
from numba import cuda,float32,int32
#vector length
N = 1000
#number of vectors
NV = 300000
#number of threads per block - must be a power of 2 less than or equal to 1024
threadsperblock = 256
#for vectors arranged row-wise
@cuda.jit('void(float32[:,:], float32[:])')
def vec_sum_row(vecs, sums):
sm = cuda.shared.array(threadsperblock, float32)
bid = cuda.blockIdx.x
tid = cuda.threadIdx.x
bdim = cuda.blockDim.x
# load shared memory with vector using block-stride loop
lid = tid
sm[lid] = 0
while lid < N:
sm[tid] += vecs[bid, lid];
lid += bdim
cuda.syncthreads()
# perform shared memory sweep reduction
sweep = bdim//2
while sweep > 0:
if tid < sweep:
sm[tid] += sm[tid + sweep]
sweep = sweep//2
cuda.syncthreads()
if tid == 0:
sums[bid] = sm[0]
#for vectors arranged column-wise
@cuda.jit('void(float32[:,:], float32[:])')
def vec_sum_col(vecs, sums):
idx = cuda.grid(1)
if idx >= NV:
return
temp = 0
for i in range(N):
temp += vecs[i,idx]
sums[idx] = temp
#peform row-test
rvecs = np.ones((NV, N), dtype=np.float32)
sums = np.zeros(NV, dtype=np.float32)
d_rvecs = cuda.to_device(rvecs)
d_sums = cuda.device_array_like(sums)
vec_sum_row[NV, threadsperblock](d_rvecs, d_sums)
d_sums.copy_to_host(sums)
print(sums[:8])
#perform column-test
cvecs = np.ones((N, NV), dtype=np.float32)
d_cvecs = cuda.to_device(cvecs)
vec_sum_col[(NV+threadsperblock-1)//threadsperblock, threadsperblock](d_cvecs, d_sums)
d_sums.copy_to_host(sums)
print(sums[:8])
# python t7.py
[1000. 1000. 1000. 1000. 1000. 1000. 1000. 1000.]
[1000. 1000. 1000. 1000. 1000. 1000. 1000. 1000.]
# nvprof python t7.py
==5931== NVPROF is profiling process 5931, command: python t7.py
[1000. 1000. 1000. 1000. 1000. 1000. 1000. 1000.]
[1000. 1000. 1000. 1000. 1000. 1000. 1000. 1000.]
==5931== Profiling application: python t7.py
==5931== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 99.20% 1.12464s 2 562.32ms 557.25ms 567.39ms [CUDA memcpy HtoD]
0.59% 6.6881ms 1 6.6881ms 6.6881ms 6.6881ms cudapy::__main__::vec_sum_row$241(Array<float, int=2, A, mutable, aligned>, Array<float, int=1, A, mutable, aligned>)
0.20% 2.2250ms 1 2.2250ms 2.2250ms 2.2250ms cudapy::__main__::vec_sum_col$242(Array<float, int=2, A, mutable, aligned>, Array<float, int=1, A, mutable, aligned>)
0.02% 212.83us 2 106.42us 104.45us 108.38us [CUDA memcpy DtoH]
API calls: 73.60% 1.12571s 2 562.85ms 557.77ms 567.94ms cuMemcpyHtoD
25.30% 386.91ms 1 386.91ms 386.91ms 386.91ms cuDevicePrimaryCtxRetain
0.64% 9.8042ms 2 4.9021ms 2.6113ms 7.1929ms cuMemcpyDtoH
0.23% 3.4945ms 3 1.1648ms 182.38us 1.6636ms cuMemAlloc
0.07% 999.98us 2 499.99us 62.409us 937.57us cuLinkCreate
0.04% 678.12us 2 339.06us 331.01us 347.12us cuModuleLoadDataEx
0.03% 458.51us 1 458.51us 458.51us 458.51us cuMemGetInfo
0.03% 431.28us 4 107.82us 98.862us 120.58us cuDeviceGetName
0.03% 409.59us 2 204.79us 200.33us 209.26us cuLinkAddData
0.03% 393.75us 2 196.87us 185.18us 208.56us cuLinkComplete
0.01% 218.68us 2 109.34us 79.726us 138.96us cuLaunchKernel
0.00% 14.052us 3 4.6840us 406ns 11.886us cuDeviceGetCount
0.00% 13.391us 12 1.1150us 682ns 1.5910us cuDeviceGetAttribute
0.00% 13.207us 8 1.6500us 1.0110us 3.1970us cuDeviceGet
0.00% 6.6800us 10 668ns 366ns 1.6910us cuFuncGetAttribute
0.00% 6.3560us 1 6.3560us 6.3560us 6.3560us cuCtxPushCurrent
0.00% 4.1940us 2 2.0970us 1.9810us 2.2130us cuModuleGetFunction
0.00% 4.0220us 4 1.0050us 740ns 1.7010us cuDeviceComputeCapability
0.00% 2.5810us 2 1.2900us 1.1740us 1.4070us cuLinkDestroy
#
If you have a choice of storage methods, the column-wise storage is preferred for performance. In the above example, the row-sum kernel took about 6.7ms whereas the column-sum kernel took about 2.2ms. The row-wise method above could possibly be improved somewhat by launching a smaller number of blocks and having each block perform multiple reductions using a loop, but it is unlikely to be faster than the column method.
Note that this code requires about 1.5GB of storage for each test (row and column) so it won't run as-is on a GPU that has a very small amount of memory (e.g. 2GB GPU). You may be able to get it to run on a small memory GPU by either doing only the row test or the column test, or else by reducing the number of vectors, for example.