2

I have a large character array in the device global memory that is accessed in a coalescent manner by threads. I've read somewhere that I could speed up memory access by reading 4 or 16 chars in one memory transaction per thread. I believe I would have to use textures and the char4 or int4 structs. However, I can't find any documentation or examples on this. Could anyone here please provide a simple example or pointers to where I can learn more about this?

In my code I define the char array as

char *database = NULL;
cudaMalloc( (void**) &database, SIZE * sizeof(char) );

What would the definition be if I want to use textures and char4 (or int4)?

Thanks very much.

Ross
  • 265
  • 1
  • 3
  • 13
  • 1
    If you pass the deviceptr to database and you guarantee it is 128-bit aligned you can simply read using int4*. This will result in 128-bit reads per thread through L1. – Greg Smith Aug 03 '12 at 02:09
  • Alright I'll try this. If SIZE is a multiple of 16 then would it 128-bit aligned? It may sound like a stupid question but I want to make sure I get it right. Could you elaborate "simply read"? Thanks. – Ross Aug 03 '12 at 05:19

1 Answers1

1

I finally figured out the answer to my own question. The definition with char4 would be

char4 *database = NULL;
cudaMalloc( (void**) &database, SIZE * sizeof(char4)/4 );

Don't need textures for this. The kernel does speedup by a factor of three with char4 but reduces to two if I do loop unrolling. For the sake of completeness my kernel is

__global__ void kernel(unsigned int jobs_todo, char* database, float* results ) {

  unsigned int id = threadIdx.x + blockIdx.x * blockDim.x;
  float A = 0; int i; char ch;
  if(id < jobs_todo) {
    for(i = 0; i < 1000; i += 1){
     ch = database[jobs_todo*i + id];
     if(ch == 'A') A++;
    }
    results[id] = A;
  }
}

And with char4 it is

__global__ void kernel4(unsigned int jobs_todo, char4* database, float* results ) {

  unsigned int id = threadIdx.x + blockIdx.x * blockDim.x;
  float A = 0; int i; char4 ch4;
  if(id < jobs_todo) {
    for(i = 0; i < 1000/4; i += 1){
     ch4 = database[jobs_todo*i + id];
     if(ch4.x == 'A') A++;
     if(ch4.y == 'A') A++;
     if(ch4.z == 'A') A++;
     if(ch4.w == 'A') A++;
    }
    results[id] = A;
  }
}

I also tried int4 but it's just .0002 seconds faster than the char4 time.

Ross
  • 265
  • 1
  • 3
  • 13