Efficiently transposing a large (dense) binary matrix

Question

There's a binary matrix (a matrix made of zeros and ones) that I'd like to transpose. Each row of the matrix is a 1D array of 32-bit integers, and the whole matrix is a 1D array of rows.

Here's an example for a 128 x 128 binary matrix, made of 128 rows of 128 / 32 32-bit integers. (In reality, the matrix is an N x N matrix, for N in the tens of thousands.)

// gcc example0.c -std=c99 && ./a.out

#include <stdlib.h>
#include <stdio.h>
#include <stdint.h>
typedef uint32_t uint32;

#define N (32 * 4)  // Assume this to be divisible by 32. The matrix is intended to be an N x N matrix
#define INTS_PER_ROW (N / 32)

int main(int argc, char** argv){
  uint32** matrix = malloc(N * sizeof(uint32*));  

  for(int i=0; i<N; ++i)
    matrix[i] = malloc(INTS_PER_ROW * sizeof(uint32));

  for(int i=0; i<N; ++i)
    for(int j=0; j<INTS_PER_ROW; ++j)
      matrix[i][j] = rand();

  for(int i=0; i<N; ++i){
    for(int j=0; j<INTS_PER_ROW; ++j){
      printf(" %08x", matrix[i][j]);
    }
    puts("");
  }

  for(int i=0; i<N; ++i)
    free(matrix[i]);
  free(matrix);
}

What's an efficient way of transposing such a matrix? (AVX2, CUDA, OpenCL... anything goes.)

(Hacker's Delight has code for transposing 8x8 binary matrices, and it suggests using their method recursively for larger matrices, but I don't know if it scales well for large matrices.)

If the transpose is for a _square_ matrix, then N*N is OK, but for a general `N*M` matrix, better to code assume `N != M` and use examples where they differ - they are easier to understand. Your example is `N*N`, yet is this for a square only or a rectangle matrix? — chux - Reinstate Monica, Oct 07 '17 at 01:42
The code you've shown doesn't transpose anything. Not sure if you were giving it as an example of a transpose or just an example of a matrix set-up. Google returns [this](https://stackoverflow.com/questions/31742483/how-would-you-transpose-a-binary-matrix) as the first hit, and the first answer there seems extensible in a relatively straightforward fashion. — Robert Crovella, Oct 07 '17 at 02:23
The most obvious way to do this is one bit at a time, i.e. zero the destination matrix, then iterate over all of the 1-bit entries, OR-ing the value from the old matrix at (x,y) into the new matrix at (y,x). If you want to try to optimize this, you could try working on small chunks, e.g. 8x8 bits, transposing them as a group and then storing them out, but how efficient this will be depends a lot on the hardware you're using (some processors even have specialized instructions for this sort of thing). — Tom Karzes, Oct 07 '17 at 02:42
@étale-cohomology The method from Hacker's Delight seems worth trying, including the suggestion to apply it recursively. GPUs support a `LOP3` instructions that implements any logical operation on three inputs, which should come in handy (the CUDA compiler will generate it automatically where applicable). I implemented something like this before, in CUDA, but it's almost five years ago and I don't recall how large I made the matrices eventually. I want to say 1 kbit square, but I am not sure. — njuffa, Oct 07 '17 at 03:34
CUDA has hardware support in the form of the __ballot() instruction [that comes very handy for tasks like this](https://stackoverflow.com/questions/39488441/how-to-pack-bits-efficiently-in-cuda/39488714#39488714). — tera, Oct 07 '17 at 04:09

Robert Crovella · Accepted Answer · 2017-10-07T22:00:28.027

Here is one possible approach using __ballot() as suggested by @tera.

in a coalesced fashion, load the data by tiles into shared memory
do an initial transpose in shared memory during the load, to arrange columns into rows to facilitate the __ballot() operation.
perform ballots to build the "final" row-wise data out of the column-wise data (that has been stored in rows)
write the data back out

Steps 1-3 should be fairly efficient on the GPU, at least as best I know how. Step 4 is hampered by the fact that a bit-wise transposition of bit-blocks does not yield a set of 32-bit quantities that are adjacent in memory. Therefore the memory access pattern on the global store is scattered, in the general case.

The output indexing operation was the most tedious to visualize/arrange.

Following is a fully-worked example with 4 test cases. For the test cases I also wrote a naive CPU function to perform the "golden" test, borrowing from another idea in the comments under the question.

$ cat t435.cu
#include <stdio.h>
#include <stdlib.h>

#define IDX(d,x,y,ld) d[y*ld+x]

#define cudaCheckErrors(msg) \
  do { \
    cudaError_t __err = cudaGetLastError(); \
    if (__err != cudaSuccess) { \
        fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
            msg, cudaGetErrorString(__err), \
            __FILE__, __LINE__); \
        fprintf(stderr, "*** FAILED - ABORTING\n"); \
        exit(1); \
    } \
  } while (0)


#include <time.h>
#include <sys/time.h>
#define USECPSEC 1000000ULL

unsigned long dtime_usec(unsigned long start){

  timeval tv;
  gettimeofday(&tv, 0);
  return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}
__global__ void bt(const unsigned * __restrict__ in, unsigned * __restrict__ out, const unsigned idim){

// assumes "square" binary matrix transpose
// assumes kernel is launched with 1024 threads per block, 2D, 32x32
// assumes input matrix dimension idim is evenly divisible by 32 bits
// assumes idim is specified in bits


  __shared__ unsigned smem[32][33];
  int idx = threadIdx.x+blockDim.x*blockIdx.x;
  int idy = threadIdx.y+blockDim.y*blockIdx.y;

  smem[threadIdx.x][threadIdx.y] = ((idx < idim/32)&&(idy < idim))?IDX(in,idx,idy,idim/32):0;
  __syncthreads();
  unsigned myval = smem[threadIdx.y][31-threadIdx.x];
  __syncthreads();
  smem[ 0][threadIdx.y] = __ballot_sync(0xFFFFFFFFU, ((1U<<31)&myval));
  smem[ 1][threadIdx.y] = __ballot_sync(0xFFFFFFFFU, ((1U<<30)&myval));
  smem[ 2][threadIdx.y] = __ballot_sync(0xFFFFFFFFU, ((1U<<29)&myval));
  smem[ 3][threadIdx.y] = __ballot_sync(0xFFFFFFFFU, ((1U<<28)&myval));
  smem[ 4][threadIdx.y] = __ballot_sync(0xFFFFFFFFU, ((1U<<27)&myval));
  smem[ 5][threadIdx.y] = __ballot_sync(0xFFFFFFFFU, ((1U<<26)&myval));
  smem[ 6][threadIdx.y] = __ballot_sync(0xFFFFFFFFU, ((1U<<25)&myval));
  smem[ 7][threadIdx.y] = __ballot_sync(0xFFFFFFFFU, ((1U<<24)&myval));
  smem[ 8][threadIdx.y] = __ballot_sync(0xFFFFFFFFU, ((1U<<23)&myval));
  smem[ 9][threadIdx.y] = __ballot_sync(0xFFFFFFFFU, ((1U<<22)&myval));
  smem[10][threadIdx.y] = __ballot_sync(0xFFFFFFFFU, ((1U<<21)&myval));
  smem[11][threadIdx.y] = __ballot_sync(0xFFFFFFFFU, ((1U<<20)&myval));
  smem[12][threadIdx.y] = __ballot_sync(0xFFFFFFFFU, ((1U<<19)&myval));
  smem[13][threadIdx.y] = __ballot_sync(0xFFFFFFFFU, ((1U<<18)&myval));
  smem[14][threadIdx.y] = __ballot_sync(0xFFFFFFFFU, ((1U<<17)&myval));
  smem[15][threadIdx.y] = __ballot_sync(0xFFFFFFFFU, ((1U<<16)&myval));
  smem[16][threadIdx.y] = __ballot_sync(0xFFFFFFFFU, ((1U<<15)&myval));
  smem[17][threadIdx.y] = __ballot_sync(0xFFFFFFFFU, ((1U<<14)&myval));
  smem[18][threadIdx.y] = __ballot_sync(0xFFFFFFFFU, ((1U<<13)&myval));
  smem[19][threadIdx.y] = __ballot_sync(0xFFFFFFFFU, ((1U<<12)&myval));
  smem[20][threadIdx.y] = __ballot_sync(0xFFFFFFFFU, ((1U<<11)&myval));
  smem[21][threadIdx.y] = __ballot_sync(0xFFFFFFFFU, ((1U<<10)&myval));
  smem[22][threadIdx.y] = __ballot_sync(0xFFFFFFFFU, ((1U<< 9)&myval));
  smem[23][threadIdx.y] = __ballot_sync(0xFFFFFFFFU, ((1U<< 8)&myval));
  smem[24][threadIdx.y] = __ballot_sync(0xFFFFFFFFU, ((1U<< 7)&myval));
  smem[25][threadIdx.y] = __ballot_sync(0xFFFFFFFFU, ((1U<< 6)&myval));
  smem[26][threadIdx.y] = __ballot_sync(0xFFFFFFFFU, ((1U<< 5)&myval));
  smem[27][threadIdx.y] = __ballot_sync(0xFFFFFFFFU, ((1U<< 4)&myval));
  smem[28][threadIdx.y] = __ballot_sync(0xFFFFFFFFU, ((1U<< 3)&myval));
  smem[29][threadIdx.y] = __ballot_sync(0xFFFFFFFFU, ((1U<< 2)&myval));
  smem[30][threadIdx.y] = __ballot_sync(0xFFFFFFFFU, ((1U<< 1)&myval));
  smem[31][threadIdx.y] = __ballot_sync(0xFFFFFFFFU, ((1U<< 0)&myval));
  __syncthreads();
  int indx = (idx*idim)+(gridDim.y*threadIdx.y)+blockIdx.y;
  if ((idx < (idim/32))&&(idy < idim)) out[indx] = smem[threadIdx.y][threadIdx.x];
}

void naive_cpu_bt(const unsigned *in, unsigned *out, const unsigned idim){
  memset(out, 0, idim*(idim/32)*sizeof(unsigned));
  for (int i = 0; i < (idim/32); i++)
    for (int j = 0; j < idim; j++){
      for (int bit = 0; bit < 32; bit++){
        if ((in[(j*(idim/32)) + i]>>bit)&1) out[(((i*32)+(31-bit))*(idim/32))+(j/32)] |=  1<<(31 - (j%32));
        }
      }
}

int main(){


  unsigned *h_idata, *h_odata, *d_idata, *d_odata, *c_odata;
  unsigned idim;
// test case 1, 32x32, upper triangular
  idim = 32;
  h_idata = (unsigned *)malloc(idim*(idim/32)*sizeof(unsigned));
  h_odata = (unsigned *)malloc(idim*(idim/32)*sizeof(unsigned));
  c_odata = (unsigned *)malloc(idim*(idim/32)*sizeof(unsigned));
  cudaMalloc(&d_idata, idim*(idim/32)*sizeof(unsigned));
  cudaMalloc(&d_odata, idim*(idim/32)*sizeof(unsigned));
  unsigned data = 0x0FFFFFFFFU;
  for (int i = 0; i < 32; i++){
    h_idata[i] = data;
    data>>=1;}
  cudaMemcpy(d_idata, h_idata, idim*(idim/32)*sizeof(unsigned), cudaMemcpyHostToDevice);
  bt<<<dim3(1,1),dim3(32,32)>>>(d_idata, d_odata, idim);
  cudaMemcpy(h_odata, d_odata, idim*(idim/32)*sizeof(unsigned), cudaMemcpyDeviceToHost);
  naive_cpu_bt(h_idata, c_odata, idim);
  for (int i = 0; i < (idim/32)*idim; i++) if (c_odata[i] != h_odata[i]) {printf("mismatch at %u, was: %u, should be: %u\n", i, h_odata[i], c_odata[i]); return -1;}
  for (int i = 0; i < 32; i++) printf("0x%8x\n", h_odata[i]);
  free(h_idata);
  free(h_odata);
  free(c_odata);
  cudaFree(d_idata);
  cudaFree(d_odata);
// test case 2, 64x64, opposite diagonal
  idim = 64;
  h_idata = (unsigned *)malloc(idim*(idim/32)*sizeof(unsigned));
  h_odata = (unsigned *)malloc(idim*(idim/32)*sizeof(unsigned));
  c_odata = (unsigned *)malloc(idim*(idim/32)*sizeof(unsigned));
  cudaMalloc(&d_idata, idim*(idim/32)*sizeof(unsigned));
  cudaMalloc(&d_odata, idim*(idim/32)*sizeof(unsigned));
  data = 0x01;
  for (int i = 0; i < 32; i++){
    h_idata[2*i] = 0; h_idata[(2*i)+1] = data;
    h_idata[64+(2*i)] = data; h_idata[65+(2*i)] = 0xFFFFFFFFU;
    data<<=1; data++;}
  cudaMemcpy(d_idata, h_idata, idim*(idim/32)*sizeof(unsigned), cudaMemcpyHostToDevice);
  bt<<<dim3(1,2),dim3(32,32)>>>(d_idata, d_odata, idim);
  cudaMemcpy(h_odata, d_odata, idim*(idim/32)*sizeof(unsigned), cudaMemcpyDeviceToHost);
  naive_cpu_bt(h_idata, c_odata, idim);
  for (int i = 0; i < (idim/32)*idim; i++) if (c_odata[i] != h_odata[i]) { printf("mismatch at %u, was: %u, should be: %u\n", i, h_odata[i], c_odata[i]); return -1;}
  for (int i = 0; i < 64; i++) printf("0x%8x 0x%8x\n", h_odata[i*2], h_odata[(i*2)+1]);
  free(h_idata);
  free(h_odata);
  free(c_odata);
  cudaFree(d_idata);
  cudaFree(d_odata);
// test case 3, 96x96, ones in alternating columns
  idim = 96;
  h_idata = (unsigned *)malloc(idim*(idim/32)*sizeof(unsigned));
  h_odata = (unsigned *)malloc(idim*(idim/32)*sizeof(unsigned));
  c_odata = (unsigned *)malloc(idim*(idim/32)*sizeof(unsigned));
  cudaMalloc(&d_idata, idim*(idim/32)*sizeof(unsigned));
  cudaMalloc(&d_odata, idim*(idim/32)*sizeof(unsigned));
  data = 0x55555555U;
  for (int i = 0; i < idim*(idim/32); i++)
    h_idata[i] = data;
  cudaMemcpy(d_idata, h_idata, idim*(idim/32)*sizeof(unsigned), cudaMemcpyHostToDevice);
  bt<<<dim3(1,3),dim3(32,32)>>>(d_idata, d_odata, idim);
  cudaMemcpy(h_odata, d_odata, idim*(idim/32)*sizeof(unsigned), cudaMemcpyDeviceToHost);
  naive_cpu_bt(h_idata, c_odata, idim);
  for (int i = 0; i < (idim/32)*idim; i++) if (c_odata[i] != h_odata[i]) { printf("mismatch at %u, was: %u, should be: %u\n", i, h_odata[i], c_odata[i]); return -1;}
  for (int i = 0; i < 96; i++) printf("0x%8x 0x%8x 0x%8x\n", h_odata[i*3], h_odata[(i*3)+1], h_odata[(i*3)+2]);
  free(h_idata);
  free(h_odata);
  free(c_odata);
  cudaFree(d_idata);
  cudaFree(d_odata);
// test case 4, 8kx8k random
  idim = 8192;
  int xblocks = (idim/1024)+((idim%1024)?1:0);
  h_idata = (unsigned *)malloc(idim*(idim/32)*sizeof(unsigned));
  h_odata = (unsigned *)malloc(idim*(idim/32)*sizeof(unsigned));
  c_odata = (unsigned *)malloc(idim*(idim/32)*sizeof(unsigned));
  cudaMalloc(&d_idata, idim*(idim/32)*sizeof(unsigned));
  cudaMalloc(&d_odata, idim*(idim/32)*sizeof(unsigned));
  for (int i = 0; i < idim*(idim/32); i++)
    h_idata[i] = rand();
  unsigned long gt = dtime_usec(0);
  cudaMemcpy(d_idata, h_idata, idim*(idim/32)*sizeof(unsigned), cudaMemcpyHostToDevice);
  unsigned long gkt = dtime_usec(0);
  bt<<<dim3(xblocks,(idim/32)),dim3(32,32)>>>(d_idata, d_odata, idim);
  cudaDeviceSynchronize();
  gkt = dtime_usec(gkt);
  cudaMemcpy(h_odata, d_odata, idim*(idim/32)*sizeof(unsigned), cudaMemcpyDeviceToHost);
  gt = dtime_usec(gt);
  unsigned long ct = dtime_usec(0);
  naive_cpu_bt(h_idata, c_odata, idim);
  ct = dtime_usec(ct);
  for (int i = 0; i < (idim/32)*idim; i++) if (c_odata[i] != h_odata[i]) { printf("mismatch at %u, was: %u, should be: %u\n", i, h_odata[i], c_odata[i]); return -1;}
  printf("gputime: %fms, kerneltime: %fms, cputime: %fms\n", (gt*1000)/(float)USECPSEC, (gkt*1000)/(float)USECPSEC, (ct*1000)/(float)USECPSEC);
  printf("kernel bandwidth = %fMB/s\n", (idim*(idim/32)*4*2)/(float)gkt);
  printf("Success!\n");
  free(h_idata);
  free(h_odata);
  free(c_odata);
  cudaFree(d_idata);
  cudaFree(d_odata);

  return 0;
}
$ nvcc -arch=sm_61 -o t435 t435.cu
$ ./t435
0x80000000
0xc0000000
0xe0000000
0xf0000000
0xf8000000
0xfc000000
0xfe000000
0xff000000
0xff800000
0xffc00000
0xffe00000
0xfff00000
0xfff80000
0xfffc0000
0xfffe0000
0xffff0000
0xffff8000
0xffffc000
0xffffe000
0xfffff000
0xfffff800
0xfffffc00
0xfffffe00
0xffffff00
0xffffff80
0xffffffc0
0xffffffe0
0xfffffff0
0xfffffff8
0xfffffffc
0xfffffffe
0xffffffff
0x       0 0x       1
0x       0 0x       3
0x       0 0x       7
0x       0 0x       f
0x       0 0x      1f
0x       0 0x      3f
0x       0 0x      7f
0x       0 0x      ff
0x       0 0x     1ff
0x       0 0x     3ff
0x       0 0x     7ff
0x       0 0x     fff
0x       0 0x    1fff
0x       0 0x    3fff
0x       0 0x    7fff
0x       0 0x    ffff
0x       0 0x   1ffff
0x       0 0x   3ffff
0x       0 0x   7ffff
0x       0 0x   fffff
0x       0 0x  1fffff
0x       0 0x  3fffff
0x       0 0x  7fffff
0x       0 0x  ffffff
0x       0 0x 1ffffff
0x       0 0x 3ffffff
0x       0 0x 7ffffff
0x       0 0x fffffff
0x       0 0x1fffffff
0x       0 0x3fffffff
0x       0 0x7fffffff
0x       0 0xffffffff
0x       1 0xffffffff
0x       3 0xffffffff
0x       7 0xffffffff
0x       f 0xffffffff
0x      1f 0xffffffff
0x      3f 0xffffffff
0x      7f 0xffffffff
0x      ff 0xffffffff
0x     1ff 0xffffffff
0x     3ff 0xffffffff
0x     7ff 0xffffffff
0x     fff 0xffffffff
0x    1fff 0xffffffff
0x    3fff 0xffffffff
0x    7fff 0xffffffff
0x    ffff 0xffffffff
0x   1ffff 0xffffffff
0x   3ffff 0xffffffff
0x   7ffff 0xffffffff
0x   fffff 0xffffffff
0x  1fffff 0xffffffff
0x  3fffff 0xffffffff
0x  7fffff 0xffffffff
0x  ffffff 0xffffffff
0x 1ffffff 0xffffffff
0x 3ffffff 0xffffffff
0x 7ffffff 0xffffffff
0x fffffff 0xffffffff
0x1fffffff 0xffffffff
0x3fffffff 0xffffffff
0x7fffffff 0xffffffff
0xffffffff 0xffffffff
0x       0 0x       0 0x       0
0xffffffff 0xffffffff 0xffffffff
0x       0 0x       0 0x       0
0xffffffff 0xffffffff 0xffffffff
0x       0 0x       0 0x       0
0xffffffff 0xffffffff 0xffffffff
0x       0 0x       0 0x       0
0xffffffff 0xffffffff 0xffffffff
0x       0 0x       0 0x       0
0xffffffff 0xffffffff 0xffffffff
0x       0 0x       0 0x       0
0xffffffff 0xffffffff 0xffffffff
0x       0 0x       0 0x       0
0xffffffff 0xffffffff 0xffffffff
0x       0 0x       0 0x       0
0xffffffff 0xffffffff 0xffffffff
0x       0 0x       0 0x       0
0xffffffff 0xffffffff 0xffffffff
0x       0 0x       0 0x       0
0xffffffff 0xffffffff 0xffffffff
0x       0 0x       0 0x       0
0xffffffff 0xffffffff 0xffffffff
0x       0 0x       0 0x       0
0xffffffff 0xffffffff 0xffffffff
0x       0 0x       0 0x       0
0xffffffff 0xffffffff 0xffffffff
0x       0 0x       0 0x       0
0xffffffff 0xffffffff 0xffffffff
0x       0 0x       0 0x       0
0xffffffff 0xffffffff 0xffffffff
0x       0 0x       0 0x       0
0xffffffff 0xffffffff 0xffffffff
0x       0 0x       0 0x       0
0xffffffff 0xffffffff 0xffffffff
0x       0 0x       0 0x       0
0xffffffff 0xffffffff 0xffffffff
0x       0 0x       0 0x       0
0xffffffff 0xffffffff 0xffffffff
0x       0 0x       0 0x       0
0xffffffff 0xffffffff 0xffffffff
0x       0 0x       0 0x       0
0xffffffff 0xffffffff 0xffffffff
0x       0 0x       0 0x       0
0xffffffff 0xffffffff 0xffffffff
0x       0 0x       0 0x       0
0xffffffff 0xffffffff 0xffffffff
0x       0 0x       0 0x       0
0xffffffff 0xffffffff 0xffffffff
0x       0 0x       0 0x       0
0xffffffff 0xffffffff 0xffffffff
0x       0 0x       0 0x       0
0xffffffff 0xffffffff 0xffffffff
0x       0 0x       0 0x       0
0xffffffff 0xffffffff 0xffffffff
0x       0 0x       0 0x       0
0xffffffff 0xffffffff 0xffffffff
0x       0 0x       0 0x       0
0xffffffff 0xffffffff 0xffffffff
0x       0 0x       0 0x       0
0xffffffff 0xffffffff 0xffffffff
0x       0 0x       0 0x       0
0xffffffff 0xffffffff 0xffffffff
0x       0 0x       0 0x       0
0xffffffff 0xffffffff 0xffffffff
0x       0 0x       0 0x       0
0xffffffff 0xffffffff 0xffffffff
0x       0 0x       0 0x       0
0xffffffff 0xffffffff 0xffffffff
0x       0 0x       0 0x       0
0xffffffff 0xffffffff 0xffffffff
0x       0 0x       0 0x       0
0xffffffff 0xffffffff 0xffffffff
0x       0 0x       0 0x       0
0xffffffff 0xffffffff 0xffffffff
0x       0 0x       0 0x       0
0xffffffff 0xffffffff 0xffffffff
0x       0 0x       0 0x       0
0xffffffff 0xffffffff 0xffffffff
0x       0 0x       0 0x       0
0xffffffff 0xffffffff 0xffffffff
0x       0 0x       0 0x       0
0xffffffff 0xffffffff 0xffffffff
0x       0 0x       0 0x       0
0xffffffff 0xffffffff 0xffffffff
0x       0 0x       0 0x       0
0xffffffff 0xffffffff 0xffffffff
0x       0 0x       0 0x       0
0xffffffff 0xffffffff 0xffffffff
0x       0 0x       0 0x       0
0xffffffff 0xffffffff 0xffffffff
0x       0 0x       0 0x       0
0xffffffff 0xffffffff 0xffffffff
0x       0 0x       0 0x       0
0xffffffff 0xffffffff 0xffffffff
0x       0 0x       0 0x       0
0xffffffff 0xffffffff 0xffffffff
gputime: 5.530000ms, kerneltime: 0.301000ms, cputime: 659.205994ms
kernel bandwidth = 55738.257812MB/s
Success!
$

For the first 3 test cases, I used 3 different input patterns, and printed out the results, in addition to doing the cpu validation. For the 4th test case (large random pattern) I skip the printout, but retain the cpu validation.

Also note that I used the __ballot_sync() variant to be compliant with CUDA 9.

I haven't done any performance analysis. Like most transpose operations, I expect a large part of the performance here to stem from efficient use of global memory. The major inhibitor there would be the scattered writes in step 4 above.

Some general background information on matrix transpose in CUDA can be found here although this bitwise case deviates substantially from it.

Thanks a lot for the answer! I wonder if `__ballot_sync(0xffffffffu, int)` is equivalent to `__ballot(int)`? For those interested, here's code to print out Robert's matrices, regardless of size (pardon the lack of indenting/newlines): `for(int i=0; i < idim; ++i){ for(int j=0; j < (idim/32); ++j){ printf(" %08x", h_odata[i * (idim/32) + j]); } putchar('\n'); }`. If you also want to print the *input matrix*, replace `h_odata` with `h_idata`. — étale-cohomology, Oct 07 '17 at 22:00
I wonder if there are any `nvcc` compiler *optimization* flags that are relevant for this kind of code (eg. I don't think `-use_fast_math` would do much good in this case) — étale-cohomology, Oct 07 '17 at 22:01
In CUDA 9, `__ballot(int)` is being deprecated, and will in the future be replaced by `__ballot_sync(0xFFFFFFFF, int)`. You can read more about it in [this question](https://stackoverflow.com/questions/46458057/some-intrinsics-named-with-sync-appended-in-cuda-9-semantics-same) and the section of the CUDA programming guide for CUDA 9 that I already linked in my answer (ie. [here](http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#warp-vote-functions)). `use_fast_math` won't help. — Robert Crovella, Oct 07 '17 at 22:03
The code reports an error for matrices of dimension at least `2080 x 2080` (eg. `mismatch at 133120, was: 4294967295, should be: 3299959124`). I wonder if there's a limit to how large each matrix can be (maybe due to the use of shared memory)? — étale-cohomology, Oct 09 '17 at 21:47
In the code that is currently posted, I changed `idim` to 2080 on the last test case, and it ran correctly for me. I have already tested the last test case up to `idim` values of 16384 and it runs correctly for me. You may be running into a display timeout issue or some other problem. Perhaps you are on windows building a debug project. That will make things run much slower. — Robert Crovella, Oct 10 '17 at 00:23
I'm on Ubuntu 14 with cuda `7.5`. I'll try upgrading to cuda `9` and see if it works — étale-cohomology, Oct 10 '17 at 00:41
This code is awesome! It does the transpose for the matrices that I want (196896 x 196896 binary matrices) in ~1 sec. It does **most-significant-bit-first** transpose, though. If anyone wants **least-significant-bit-first** transpose, then do as follows. 0) In `myval = smem[threadIdx.y][31-threadIdx.x];`, change to `31-threadIdx.x` to just `threadIdx.x`. 1) Change the left-shifts in the ballot shuffle so that they go from 0 to 31 (and *not* from 31 to 0). Eg. in `__ballot_sync(0xFFFFFFFFU, ((1U<<31)&myval));`, change `1U<<31` to `1U<<0`. In the next line, change `1U<<30` to `1U<<1`, and so on. — étale-cohomology, Oct 09 '19 at 14:24

Efficiently transposing a large (dense) binary matrix

1 Answers1