Merge sort using CUDA: efficient implementation for small input arrays

Question

I have the following problem: given two sorted arrays A and B, I have to produce a sorted array C with the elements of A and B.

I found some solution for solving this problem using CUDA: Merge Path, for example http://www.cc.gatech.edu/~bader/papers/GPUMergePath-ICS2012.pdf

However, their problem is given by the size of the original arrays, at least 10k elements. I have a different problem.

The arrays I've to merge are much smaller (1000 elements at most) and the complexity is given by the number of merges to be done (the order of 10 to the power of 10, 10^5 arrays of size ~1000 to be merged with each other).

Part of their problem is to split the original arrays into equally sized parts that are processed in parallel. The arrays I have to merge are small enough to entirely fit in the GPU shared memory.

Thrust is not my first choice because the output of my procedure is not the sorted array itself, but a calculation with its elements: so I think that a specialized kernel should be faster than first sort the element indices and then use them for the calculation.

A serial version of the algorithm looks like:

i=0
j=0
k=0
T=4
while i<N and j<M:
    if A[i]<B[j]:
        start_i = max(0,i-T)
        C[k]=sum(A[start_i:i+1])
        i+=1
    else:
        start_j = max(0,j-T)
        C[k]=sum(B[start_j:j+1])
        j+=1
    k+=1

while i<N:
    start_i = max(0,i-T)
    C[k]=sum(A[start_i:i+1])
    i+=1
    k+=1
while j<M:
    start_j = max(0,j-T)
    C[k]=sum(B[start_j:j+1])
    j+=1
    k+=1

How can I exploit CUDA capabilities to solve this problem?

have you benchmarked existing implementations such as [thrust::merge](https://thrust.github.io/doc/group__merging.html#gad0269cdfcd1048406ca429c06f42fef1)? — m.s., Jun 09 '15 at 11:07
"How can I exploit CUDA capabilities to solve this problem?" I doubt we can help if you don't tell us which merge sort algorithm you're attempting, especially considering "the complexity is given by the number of merges to be done (the order of 10 to the power of 10)" — Christian Sarofeen, Jun 09 '15 at 11:29
what is the name of this algorithm? `min(0,i-T)` seems to produce negative values, is this treated as backward indexing? can you provide a fully functional C++ implementation of the serial one and some demo input and desired output? — m.s., Jun 09 '15 at 13:08
Sorry, I wrote this snippet from scratch and I made a mistake. It's max, not min. I thought that Python-like coding would be much clearer — M. G., Jun 09 '15 at 13:17
with `max`, this could lead to `sum(A[1:0]`), how is this treated? — m.s., Jun 09 '15 at 13:19
as an empty array, hence 0. That operation is sum the last T elements of array A (or B), ending at index i (or j) — M. G., Jun 09 '15 at 13:39
Sorry I can't figure this out from your python code. You have 10^10 sorted arrays of length ~1000 that you want to merge (all of them) into a single (sorted) array? — Robert Crovella, Jun 09 '15 at 14:49
based on what input data? do you have (10^10 * 2) input arrays of length ~ 500 each? — m.s., Jun 09 '15 at 15:06
thrust algorithms can be implemented sequentially (i.e. operating within a single thread) and are callable from device code. You might try implementing a thrust::merge per CUDA thread. — Robert Crovella, Jun 09 '15 at 15:35
After reviewing the paper, the authors point out that most of the novel contribution is around partitioning of the two arrays to very small units that are mergeable. Eventually, when the unit of work is small enough, they resort to a sequential merge running on a single CUDA thread. Your problem, by comparison, seems to have no difficulty with partitioning -- your data is already partitioned into separate mergeable arrays. With a large number of these, I would expect that it is easy to fill the GPU with this work, and it would be roughly equivalent to that stage of work described in the paper — Robert Crovella, Jun 09 '15 at 23:45

score 9 · Answer 1 · edited May 23 '17 at 12:34

The two most important optimization goals for any CUDA program should be to:

expose (sufficient) parallelism
make efficient use of memory

There are certainly many other things that can be considered during optimization, but these are the two most important items to address first.

A merge operation (not quite the same as a merge-sort) is, at first glance, an inherently serial operation. We cannot make a proper decision about which item to choose, from either A or B input array, to place next in the output array C, until we have made all previous selections in C. In this respect, the merge algorithm (in this realization) makes it difficult to expose parallelism, and the paper linked in the question is almost entirely focused on that topic.

The goal of the algorithm described in the paper is to decompose the two input vectors A and B into multiple smaller pieces that can be worked on independently, so as to expose parallelism. In particular, the goal is to keep all the SMs in a GPU busy, and keep all the SP's in an SM busy. Once a sufficient decomposition of work is performed, each SP is ultimately performing a sequential merge (as mentioned in the paper):

Merging stage - Each core merges the two sub arrays that it has been given using the same algorithm as a simple sequential merge.

However, as you've pointed out, what you want to do is somewhat different. You already have many arrays, and you want to perform independent merge operations on those many arrays. Since your array count is ~100,000, this is enough independent pieces of work to consider mapping each to a GPU SP (ie. thread). This means that we can then, just as in the paper, use a simple sequential merge on each core/SP/thread. So the problem of exposing parallelism is in your case, already done (to perhaps a sufficient degree).

At this point we could consider implementing this as-is. The code I show later offers this as a starting point for comparison. However what we discover is the performance is not very good, and this is due to the fact that a merge algorithm fundamentally has a data-dependent access sequence, and so it is (more) difficult to arrange for coalesced access on the GPU. The authors of the paper propose to mitigate this problem by first reading the data (in a coalesced fashion) into shared memory, and then having the algorithm work on it out of shared memory, where the penalty for disorganized access is less.

I'll propose a different approach:

arrange the sequential merge algorithm so that each element of A and B need only be read once
arrange the storage of A, B, and C in column-major form as opposed to the more "natural" row-major storage that one might consider. This is effectively transposing the storage matrices for A, B, and C vectors. This allows for an improvement in coalesced access, as the GPU threads navigate their way through the merging operation on their individual A and B vectors. It's far from perfect, but the improvement is substantial.

Here's a worked example that implements the above idea, running a simple sequential merge in each thread, and each thread merging one of the A vectors with one of the B vectors:

$ cat t784.cu
#include <stdio.h>
#include <stdlib.h>
#include <thrust/sort.h>
#include <thrust/merge.h>

#define NUM_SETS 100000
#define DSIZE 100
typedef int mytype;

// for ascending sorted data
#define cmp(A,B) ((A)<(B))
#define nTPB 512
#define nBLK 128

#include <time.h>
#include <sys/time.h>
#define USECPSEC 1000000ULL

long long dtime_usec(unsigned long long start){

  timeval tv;
  gettimeofday(&tv, 0);
  return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}

template <typename T>
__host__ __device__ void smerge(const T * __restrict__  a, const T * __restrict__ b, T * __restrict__ c, const unsigned len_a, const unsigned len_b, const unsigned stride_a = 1, const unsigned stride_b = 1, const unsigned stride_c = 1){

  unsigned len_c = len_a+len_b;
  unsigned nc = 0;
  unsigned na = 0;
  unsigned nb = 0;
  unsigned fa = (len_b == 0);
  unsigned fb = (len_a == 0);
  T nxta = a[0];
  T nxtb = b[0];
  while (nc < len_c){
    if (fa)      {c[stride_c*nc++] = nxta; na++; nxta = a[stride_a*na];}
    else if (fb) {c[stride_c*nc++] = nxtb; nb++; nxtb = b[stride_b*nb];}
    else if (cmp(nxta,nxtb)){
      c[stride_c*nc++] = nxta;
      na++;
      if (na == len_a) fb++;
      else nxta = a[stride_a*na];}
    else {
      c[stride_c*nc++] = nxtb;
      nb++;
      if (nb == len_b) fa++;
      else nxtb = b[stride_b*nb];}}
}



template <typename T>
__global__ void rmtest(const T * __restrict__  a, const T * __restrict__ b, T * __restrict__  c, int num_arr, int len){

  int idx=threadIdx.x+blockDim.x*blockIdx.x;

  while (idx < num_arr){
    int sel=idx*len;
    smerge(a+sel, b+sel, c+(2*sel), len, len);
    idx += blockDim.x*gridDim.x;}
}

template <typename T>
__global__ void cmtest(const T * __restrict__ a, const T * __restrict__ b, T * __restrict__ c, int num_arr, int len, int stride_a, int stride_b, int stride_c){
  int idx=threadIdx.x+blockDim.x*blockIdx.x;
  while (idx < num_arr){
    smerge(a+idx, b+idx, c+idx, len, len, stride_a, stride_b, stride_c);
    idx += blockDim.x*gridDim.x;}
}




template <typename T>
int rmvalidate(T *a, T *b, T *c, int num_arr, int len){

  T *vc = (T *)malloc(2*len*sizeof(T));
  for (int i = 0; i < num_arr; i++){
    thrust::merge(a+(i*len), a+((i+1)*len), b+(i*len), b+((i+1)*len), vc);
#ifndef TIMING
    for (int j = 0; j < len*2; j++)
      if (vc[j] != c[(i*2*len)+j]) {printf("rm mismatch i: %d, j: %d, was: %d, should be: %d\n", i, j, c[(i*2*len)+j], vc[j]); return 0;}
#endif
    }
  return 1;
}

template <typename T>
int cmvalidate(const T *c1, const T *c2, int num_arr, int len){
  for (int i = 0; i < num_arr; i++)
    for (int j = 0; j < 2*len; j++)
      if (c1[i*(2*len)+j] != c2[j*(num_arr)+i]) {printf("cm mismatch i: %d, j: %d, was: %d, should be: %d\n", i, j, c2[j*(num_arr)+i], c1[i*(2*len)+j]); return 0;}
  return 1;
}

int main(){


  mytype *h_a, *h_b, *h_c, *d_a, *d_b, *d_c;
  h_a = (mytype *)malloc(DSIZE*NUM_SETS*sizeof(mytype));
  h_b = (mytype *)malloc(DSIZE*NUM_SETS*sizeof(mytype));
  h_c = (mytype *)malloc(DSIZE*NUM_SETS*sizeof(mytype)*2);
  cudaMalloc(&d_a, (DSIZE*NUM_SETS+1)*sizeof(mytype));
  cudaMalloc(&d_b, (DSIZE*NUM_SETS+1)*sizeof(mytype));
  cudaMalloc(&d_c, DSIZE*NUM_SETS*sizeof(mytype)*2);
// test "row-major" storage
  for (int i =0; i<DSIZE*NUM_SETS; i++){
    h_a[i] = rand();
    h_b[i] = rand();}
  thrust::sort(h_a, h_a+DSIZE*NUM_SETS);
  thrust::sort(h_b, h_b+DSIZE*NUM_SETS);
  cudaMemcpy(d_a, h_a, DSIZE*NUM_SETS*sizeof(mytype), cudaMemcpyHostToDevice);
  cudaMemcpy(d_b, h_b, DSIZE*NUM_SETS*sizeof(mytype), cudaMemcpyHostToDevice);
  unsigned long gtime = dtime_usec(0);
  rmtest<<<nBLK, nTPB>>>(d_a, d_b, d_c, NUM_SETS, DSIZE);
  cudaDeviceSynchronize();
  gtime = dtime_usec(gtime);
  cudaMemcpy(h_c, d_c, DSIZE*NUM_SETS*2*sizeof(mytype), cudaMemcpyDeviceToHost);
  unsigned long ctime = dtime_usec(0);
  if (!rmvalidate(h_a, h_b, h_c, NUM_SETS, DSIZE)) {printf("fail!\n"); return 1;}
  ctime = dtime_usec(ctime);
  printf("CPU time: %f, GPU RM time: %f\n", ctime/(float)USECPSEC, gtime/(float)USECPSEC);
// test "col-major" storage
  mytype *ch_a, *ch_b, *ch_c;
  ch_a = (mytype *)malloc(DSIZE*NUM_SETS*sizeof(mytype));
  ch_b = (mytype *)malloc(DSIZE*NUM_SETS*sizeof(mytype));
  ch_c = (mytype *)malloc(DSIZE*NUM_SETS*sizeof(mytype));
  for (int i = 0; i < NUM_SETS; i++)
    for (int j = 0; j < DSIZE; j++){
      ch_a[j*NUM_SETS+i] = h_a[i*DSIZE+j];
      ch_b[j*NUM_SETS+i] = h_b[i*DSIZE+j];}
  cudaMemcpy(d_a, ch_a, DSIZE*NUM_SETS*sizeof(mytype), cudaMemcpyHostToDevice);
  cudaMemcpy(d_b, ch_b, DSIZE*NUM_SETS*sizeof(mytype), cudaMemcpyHostToDevice);
  gtime = dtime_usec(0);
  cmtest<<<nBLK, nTPB>>>(d_a, d_b, d_c, NUM_SETS, DSIZE, NUM_SETS, NUM_SETS, NUM_SETS );
  cudaDeviceSynchronize();
  gtime = dtime_usec(gtime);
  cudaMemcpy(ch_c, d_c, DSIZE*NUM_SETS*2*sizeof(mytype), cudaMemcpyDeviceToHost);
  if (!cmvalidate(h_c, ch_c, NUM_SETS, DSIZE)) {printf("fail!\n"); return 1;}

  printf("GPU CM time: %f\n", gtime/(float)USECPSEC);
  return 0;
}

$ nvcc -O3 -DTIMING -o t784 t784.cu
$ ./t784
CPU time: 0.030691, GPU RM time: 0.045814
GPU CM time: 0.002784
$

Notes:

The GPU is actually slower than the naive single-threaded CPU code when the memory organization is row major. But for the column-major organization (which tends to improve opportunities for coalesced access) the GPU code is about 10x faster than the CPU code for my test case. This ~10x speedup factor is roughly in the range (~10-20x) of the speedup factors shown in the paper for a GPU MergePath 32-bit integer speedup vs. x86 serial merge.
using int vs. float datatypes makes a significant difference in the CPU timing. int seems to be faster (on the CPU) so I'm showing that version here. (This disparity is mentioned in the paper as well.)
The -DTIMING switch added to the compile command pares down the first validation function so that it just does the CPU merge operation, for timing.
The basic merge code is templated to be able to handle different data types, and parameterized so that it can used in either the column-major or row major operation.
I've dispensed with CUDA error checking for brevity of presenation. However, any time you're having trouble with a CUDA code, you should always use proper cuda error checking.

What about using thrust (as I suggested in the comments)? It should be possible to use thrust::merge with a suitable device/sequential execution policy, to more or less mimic what I have done above. However, thrust expects vectors to be contiguous, and so, without additional complexity, it could only be used in the row-major case, which we've seen is severely penalized by bad memory access patterns. It should be possible to create a set of permutation iterators in thrust that would allow the column-major, strided access that improves the memory scenario, but I have not pursued that.

Thank you very much Robert, your solution was the best one. – M. G. Jun 22 '15 at 14:24 — M. G., Jun 22 '15 at 14:24

Merge sort using CUDA: efficient implementation for small input arrays

1 Answers1