5

I am thinking about reworking my GPU OpenCL kernel to speed things up. The problem is there is a lot of global memory that is not coalesced and fetches are really bringing down the performance. So I am planning to copy as much of the global memory into local but I have to pick what to copy.

Now my question is: Do many fetches of small chunks of memory hurt more than fewer fetches of larger chunks?

Nigel
  • 500
  • 5
  • 10

3 Answers3

5

You can use clGetDeviceInfo to find out what the cacheline size is for a device. (clGetDeviceInfo, CL_DEVICE_GLOBAL_MEM_CACHELINE_SIZE) On many devices today, this value is typically 16 bytes.

Small reads can be troublesome, but if you are reading from the same cacheline, you should be fine. The short answer: you need to keep your 'small chunks' close together in memory to keep it fast.

I have two functions below to demonstrate two ways to access the memory -- vectorAddFoo, and vectorAddBar. The third function copySomeMemory(...) applies to your question specifically. Both vector functions have their work items add a portion of the vectors being added, but use different memory access patterns. vectorAddFoo gets each work item to process a block of vector elements, starting at its calculated position in the arrays, and moving forward through its workload. vectorAddBar has work items start at their gid and skip gSize (= global size) elements before fetching and adding the next elements.

vectorAddBar will execute faster because of the reads and writes falling into the same cacheline in memory. Every 4 float reads will fall on the same cacheline, and take only one action from the memory controller to perform. After reading a[] and b[] in this matter, all four work items will be able to do their addition, and queue their write to c[].

vectorAddFoo will guarantee the reads and writes are not in the same cacheline (except for very short vectors ~totalElements<5). Every read from a work item will require an action from the memory controller. Unless the gpu caches the following 3 floats in every case, this will result in 4x the memory access.

__kernel void  
vectorAddFoo(__global const float * a,  
          __global const float * b,  
          __global       float * c,
          __global const totalElements) 
{ 
  int gid = get_global_id(0); 
  int elementsPerWorkItem = totalElements/get_global_size(0);
  int start = elementsPerWorkItem * gid;

  for(int i=0;i<elementsPerWorkItem;i++){
    c[start+i] = a[start+i] + b[start+i]; 
  }
} 
__kernel void  
vectorAddBar(__global const float * a,  
          __global const float * b,  
          __global       float * c,
          __global const totalElements) 
{ 
  int gid = get_global_id(0); 
  int gSize = get_global_size(0);

  for(int i=gid;i<totalElements;i+=gSize){
    c[i] = a[i] + b[i]; 
  }
} 
__kernel void  
copySomeMemory(__global const int * src,
          __global const count,
          __global const position) 
{ 
  //copy 16kb of integers to local memory, starting at 'position'
  int start = position + get_local_id(0); 
  int lSize = get_local_size(0);
  __local dst[4096];
  for(int i=0;i<4096;i+=lSize ){
    dst[start+i] = src[start+i]; 
  }
  barrier(CLK_GLOBAL_MEM_FENCE);
  //use dst here...
} 
mfa
  • 5,017
  • 2
  • 23
  • 28
1

In general, fewer fecthes of larger size will be more efficient. I can't give you specific advice without seeing your code, but make sure to access sequential chunks from the work-items to enable 'streaming'. Do any transpositions or random memory accesses after you bring the data into local memory.

Lubo Antonov
  • 2,301
  • 14
  • 18
0

I am not able to understand you question properly , but if you have large global access and if those are re-used than use use local memory.

Note:small local work size less data shared so no use, large local work size less parallel threads . So you need to select the best one.

Megharaj
  • 1,589
  • 2
  • 20
  • 32