The two most important optimization goals for any CUDA program should be to:
- expose (sufficient) parallelism
- make efficient use of memory
There are certainly many other things that can be considered during optimization, but these are the two most important items to address first.
A merge operation (not quite the same as a merge-sort) is, at first glance, an inherently serial operation. We cannot make a proper decision about which item to choose, from either A or B input array, to place next in the output array C, until we have made all previous selections in C. In this respect, the merge algorithm (in this realization) makes it difficult to expose parallelism, and the paper linked in the question is almost entirely focused on that topic.
The goal of the algorithm described in the paper is to decompose the two input vectors A and B into multiple smaller pieces that can be worked on independently, so as to expose parallelism. In particular, the goal is to keep all the SMs in a GPU busy, and keep all the SP's in an SM busy. Once a sufficient decomposition of work is performed, each SP is ultimately performing a sequential merge (as mentioned in the paper):
- Merging stage - Each core merges the two sub arrays
that it has been given using the same algorithm as a
simple sequential merge.
However, as you've pointed out, what you want to do is somewhat different. You already have many arrays, and you want to perform independent merge operations on those many arrays. Since your array count is ~100,000, this is enough independent pieces of work to consider mapping each to a GPU SP (ie. thread). This means that we can then, just as in the paper, use a simple sequential merge on each core/SP/thread. So the problem of exposing parallelism is in your case, already done (to perhaps a sufficient degree).
At this point we could consider implementing this as-is. The code I show later offers this as a starting point for comparison. However what we discover is the performance is not very good, and this is due to the fact that a merge algorithm fundamentally has a data-dependent access sequence, and so it is (more) difficult to arrange for coalesced access on the GPU. The authors of the paper propose to mitigate this problem by first reading the data (in a coalesced fashion) into shared memory, and then having the algorithm work on it out of shared memory, where the penalty for disorganized access is less.
I'll propose a different approach:
- arrange the sequential merge algorithm so that each element of A and B need only be read once
- arrange the storage of A, B, and C in column-major form as opposed to the more "natural" row-major storage that one might consider. This is effectively transposing the storage matrices for A, B, and C vectors. This allows for an improvement in coalesced access, as the GPU threads navigate their way through the merging operation on their individual A and B vectors. It's far from perfect, but the improvement is substantial.
Here's a worked example that implements the above idea, running a simple sequential merge in each thread, and each thread merging one of the A vectors with one of the B vectors:
$ cat t784.cu
#include <stdio.h>
#include <stdlib.h>
#include <thrust/sort.h>
#include <thrust/merge.h>
#define NUM_SETS 100000
#define DSIZE 100
typedef int mytype;
// for ascending sorted data
#define cmp(A,B) ((A)<(B))
#define nTPB 512
#define nBLK 128
#include <time.h>
#include <sys/time.h>
#define USECPSEC 1000000ULL
long long dtime_usec(unsigned long long start){
timeval tv;
gettimeofday(&tv, 0);
return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}
template <typename T>
__host__ __device__ void smerge(const T * __restrict__ a, const T * __restrict__ b, T * __restrict__ c, const unsigned len_a, const unsigned len_b, const unsigned stride_a = 1, const unsigned stride_b = 1, const unsigned stride_c = 1){
unsigned len_c = len_a+len_b;
unsigned nc = 0;
unsigned na = 0;
unsigned nb = 0;
unsigned fa = (len_b == 0);
unsigned fb = (len_a == 0);
T nxta = a[0];
T nxtb = b[0];
while (nc < len_c){
if (fa) {c[stride_c*nc++] = nxta; na++; nxta = a[stride_a*na];}
else if (fb) {c[stride_c*nc++] = nxtb; nb++; nxtb = b[stride_b*nb];}
else if (cmp(nxta,nxtb)){
c[stride_c*nc++] = nxta;
na++;
if (na == len_a) fb++;
else nxta = a[stride_a*na];}
else {
c[stride_c*nc++] = nxtb;
nb++;
if (nb == len_b) fa++;
else nxtb = b[stride_b*nb];}}
}
template <typename T>
__global__ void rmtest(const T * __restrict__ a, const T * __restrict__ b, T * __restrict__ c, int num_arr, int len){
int idx=threadIdx.x+blockDim.x*blockIdx.x;
while (idx < num_arr){
int sel=idx*len;
smerge(a+sel, b+sel, c+(2*sel), len, len);
idx += blockDim.x*gridDim.x;}
}
template <typename T>
__global__ void cmtest(const T * __restrict__ a, const T * __restrict__ b, T * __restrict__ c, int num_arr, int len, int stride_a, int stride_b, int stride_c){
int idx=threadIdx.x+blockDim.x*blockIdx.x;
while (idx < num_arr){
smerge(a+idx, b+idx, c+idx, len, len, stride_a, stride_b, stride_c);
idx += blockDim.x*gridDim.x;}
}
template <typename T>
int rmvalidate(T *a, T *b, T *c, int num_arr, int len){
T *vc = (T *)malloc(2*len*sizeof(T));
for (int i = 0; i < num_arr; i++){
thrust::merge(a+(i*len), a+((i+1)*len), b+(i*len), b+((i+1)*len), vc);
#ifndef TIMING
for (int j = 0; j < len*2; j++)
if (vc[j] != c[(i*2*len)+j]) {printf("rm mismatch i: %d, j: %d, was: %d, should be: %d\n", i, j, c[(i*2*len)+j], vc[j]); return 0;}
#endif
}
return 1;
}
template <typename T>
int cmvalidate(const T *c1, const T *c2, int num_arr, int len){
for (int i = 0; i < num_arr; i++)
for (int j = 0; j < 2*len; j++)
if (c1[i*(2*len)+j] != c2[j*(num_arr)+i]) {printf("cm mismatch i: %d, j: %d, was: %d, should be: %d\n", i, j, c2[j*(num_arr)+i], c1[i*(2*len)+j]); return 0;}
return 1;
}
int main(){
mytype *h_a, *h_b, *h_c, *d_a, *d_b, *d_c;
h_a = (mytype *)malloc(DSIZE*NUM_SETS*sizeof(mytype));
h_b = (mytype *)malloc(DSIZE*NUM_SETS*sizeof(mytype));
h_c = (mytype *)malloc(DSIZE*NUM_SETS*sizeof(mytype)*2);
cudaMalloc(&d_a, (DSIZE*NUM_SETS+1)*sizeof(mytype));
cudaMalloc(&d_b, (DSIZE*NUM_SETS+1)*sizeof(mytype));
cudaMalloc(&d_c, DSIZE*NUM_SETS*sizeof(mytype)*2);
// test "row-major" storage
for (int i =0; i<DSIZE*NUM_SETS; i++){
h_a[i] = rand();
h_b[i] = rand();}
thrust::sort(h_a, h_a+DSIZE*NUM_SETS);
thrust::sort(h_b, h_b+DSIZE*NUM_SETS);
cudaMemcpy(d_a, h_a, DSIZE*NUM_SETS*sizeof(mytype), cudaMemcpyHostToDevice);
cudaMemcpy(d_b, h_b, DSIZE*NUM_SETS*sizeof(mytype), cudaMemcpyHostToDevice);
unsigned long gtime = dtime_usec(0);
rmtest<<<nBLK, nTPB>>>(d_a, d_b, d_c, NUM_SETS, DSIZE);
cudaDeviceSynchronize();
gtime = dtime_usec(gtime);
cudaMemcpy(h_c, d_c, DSIZE*NUM_SETS*2*sizeof(mytype), cudaMemcpyDeviceToHost);
unsigned long ctime = dtime_usec(0);
if (!rmvalidate(h_a, h_b, h_c, NUM_SETS, DSIZE)) {printf("fail!\n"); return 1;}
ctime = dtime_usec(ctime);
printf("CPU time: %f, GPU RM time: %f\n", ctime/(float)USECPSEC, gtime/(float)USECPSEC);
// test "col-major" storage
mytype *ch_a, *ch_b, *ch_c;
ch_a = (mytype *)malloc(DSIZE*NUM_SETS*sizeof(mytype));
ch_b = (mytype *)malloc(DSIZE*NUM_SETS*sizeof(mytype));
ch_c = (mytype *)malloc(DSIZE*NUM_SETS*sizeof(mytype));
for (int i = 0; i < NUM_SETS; i++)
for (int j = 0; j < DSIZE; j++){
ch_a[j*NUM_SETS+i] = h_a[i*DSIZE+j];
ch_b[j*NUM_SETS+i] = h_b[i*DSIZE+j];}
cudaMemcpy(d_a, ch_a, DSIZE*NUM_SETS*sizeof(mytype), cudaMemcpyHostToDevice);
cudaMemcpy(d_b, ch_b, DSIZE*NUM_SETS*sizeof(mytype), cudaMemcpyHostToDevice);
gtime = dtime_usec(0);
cmtest<<<nBLK, nTPB>>>(d_a, d_b, d_c, NUM_SETS, DSIZE, NUM_SETS, NUM_SETS, NUM_SETS );
cudaDeviceSynchronize();
gtime = dtime_usec(gtime);
cudaMemcpy(ch_c, d_c, DSIZE*NUM_SETS*2*sizeof(mytype), cudaMemcpyDeviceToHost);
if (!cmvalidate(h_c, ch_c, NUM_SETS, DSIZE)) {printf("fail!\n"); return 1;}
printf("GPU CM time: %f\n", gtime/(float)USECPSEC);
return 0;
}
$ nvcc -O3 -DTIMING -o t784 t784.cu
$ ./t784
CPU time: 0.030691, GPU RM time: 0.045814
GPU CM time: 0.002784
$
Notes:
The GPU is actually slower than the naive single-threaded CPU code when the memory organization is row major. But for the column-major organization (which tends to improve opportunities for coalesced access) the GPU code is about 10x faster than the CPU code for my test case. This ~10x speedup factor is roughly in the range (~10-20x) of the speedup factors shown in the paper for a GPU MergePath 32-bit integer speedup vs. x86 serial merge.
using int
vs. float
datatypes makes a significant difference in the CPU timing. int
seems to be faster (on the CPU) so I'm showing that version here. (This disparity is mentioned in the paper as well.)
The -DTIMING
switch added to the compile command pares down the first validation function so that it just does the CPU merge operation, for timing.
The basic merge code is templated to be able to handle different data types, and parameterized so that it can used in either the column-major or row major operation.
I've dispensed with CUDA error checking for brevity of presenation. However, any time you're having trouble with a CUDA code, you should always use proper cuda error checking.
What about using thrust (as I suggested in the comments)? It should be possible to use thrust::merge with a suitable device/sequential execution policy, to more or less mimic what I have done above. However, thrust expects vectors to be contiguous, and so, without additional complexity, it could only be used in the row-major case, which we've seen is severely penalized by bad memory access patterns. It should be possible to create a set of permutation iterators in thrust that would allow the column-major, strided access that improves the memory scenario, but I have not pursued that.