7

I would like to read (BS_X+1)*(BS_Y+1) global memory locations by BS_x*BS_Y threads moving the contents to shared memory and I have developed the following code.

int i       = threadIdx.x;
int j       = threadIdx.y;
int idx     = blockIdx.x*BLOCK_SIZE_X + threadIdx.x;
int idy     = blockIdx.y*BLOCK_SIZE_Y + threadIdx.y;

int index1  = j*BLOCK_SIZE_Y+i;

int i1      = (index1)%(BLOCK_SIZE_X+1);
int j1      = (index1)/(BLOCK_SIZE_Y+1);

int i2      = (BLOCK_SIZE_X*BLOCK_SIZE_Y+index1)%(BLOCK_SIZE_X+1);
int j2      = (BLOCK_SIZE_X*BLOCK_SIZE_Y+index1)/(BLOCK_SIZE_Y+1);

__shared__ double Ezx_h_shared_ext[BLOCK_SIZE_X+1][BLOCK_SIZE_Y+1];     

Ezx_h_shared_ext[i1][j1]=Ezx_h[(blockIdx.y*BLOCK_SIZE_Y+j1)*xdim+(blockIdx.x*BLOCK_SIZE_X+i1)];

if ((i2<(BLOCK_SIZE_X+1))&&(j2<(BLOCK_SIZE_Y+1))) 
    Ezx_h_shared_ext[i2][j2]=Ezx_h[(blockIdx.y*BLOCK_SIZE_Y+j2)*xdim+(blockIdx.x*BLOCK_SIZE_X+i2)];

In my understanding, coalescing is the parallel equivalent of consecutive memory reads of sequential processing. How can I detect now if the global memory accesses are coalesced? I remark that there is an index jump from (i1,j1) to (i2,j2). Thanks in advance.

paleonix
  • 2,293
  • 1
  • 13
  • 29
Vitality
  • 20,705
  • 4
  • 108
  • 146

3 Answers3

6

I've evaluated the memory accesses of your code with a hand-written coalescing analyzer. The evaluation shows the code less exploits the coalescing. Here is the coalescing analyzer that you may find useful:

#include <stdio.h>
#include <malloc.h>

typedef struct dim3_t{
    int x;
    int y;
} dim3;


// KERNEL LAUNCH PARAMETERS
#define GRIDDIMX 4
#define GRIDDIMY 4
#define BLOCKDIMX 16
#define BLOCKDIMY 16


// ARCHITECTURE DEPENDENT
// number of threads aggregated for coalescing
#define COALESCINGWIDTH 32
// number of bytes in one coalesced transaction
#define CACHEBLOCKSIZE 128
#define CACHE_BLOCK_ADDR(addr,size)  (addr*size)&(~(CACHEBLOCKSIZE-1))


int main(){
    // fixed dim3 variables
    // grid and block size
    dim3 blockDim,gridDim;
    blockDim.x=BLOCKDIMX;
    blockDim.y=BLOCKDIMY;
    gridDim.x=GRIDDIMX;
    gridDim.y=GRIDDIMY;

    // counters
    int unq_accesses=0;
    int *unq_addr=(int*)malloc(sizeof(int)*COALESCINGWIDTH);
    int total_unq_accesses=0;

    // iter over total number of threads
    // and count the number of memory requests (the coalesced requests)
    int I, II, III;
    for(I=0; I<GRIDDIMX*GRIDDIMY; I++){
        dim3 blockIdx;
        blockIdx.x = I%GRIDDIMX;
        blockIdx.y = I/GRIDDIMX;
        for(II=0; II<BLOCKDIMX*BLOCKDIMY; II++){
            if(II%COALESCINGWIDTH==0){
                // new coalescing bunch
                total_unq_accesses+=unq_accesses;
                unq_accesses=0;
            }
            dim3 threadIdx;
            threadIdx.x=II%BLOCKDIMX;
            threadIdx.y=II/BLOCKDIMX;

            ////////////////////////////////////////////////////////
            // Change this section to evaluate different accesses //
            ////////////////////////////////////////////////////////
            // do your indexing here
            #define BLOCK_SIZE_X BLOCKDIMX
            #define BLOCK_SIZE_Y BLOCKDIMY
            #define xdim 32
            int i       = threadIdx.x;
            int j       = threadIdx.y;
            int idx     = blockIdx.x*BLOCK_SIZE_X + threadIdx.x;
            int idy     = blockIdx.y*BLOCK_SIZE_Y + threadIdx.y;

            int index1  = j*BLOCK_SIZE_Y+i;

            int i1      = (index1)%(BLOCK_SIZE_X+1);
            int j1      = (index1)/(BLOCK_SIZE_Y+1);

            int i2      = (BLOCK_SIZE_X*BLOCK_SIZE_Y+index1)%(BLOCK_SIZE_X+1);
            int j2      = (BLOCK_SIZE_X*BLOCK_SIZE_Y+index1)/(BLOCK_SIZE_Y+1);
            // calculate the accessed location and offset here
            // change the line "Ezx_h[(blockIdx.y*BLOCK_SIZE_Y+j1)*xdim+(blockIdx.x*BLOCK_SIZE_X+i1)];" to
            int addr = (blockIdx.y*BLOCK_SIZE_Y+j1)*xdim+(blockIdx.x*BLOCK_SIZE_X+i1);
            int size = sizeof(double);
            //////////////////////////
            // End of modifications //
            //////////////////////////

            printf("tid (%d,%d) from blockid (%d,%d) accessing to block %d\n",threadIdx.x,threadIdx.y,blockIdx.x,blockIdx.y,CACHE_BLOCK_ADDR(addr,size));
            // check whether it can be merged with existing requests or not
            short merged=0;
            for(III=0; III<unq_accesses; III++){
                if(CACHE_BLOCK_ADDR(addr,size)==CACHE_BLOCK_ADDR(unq_addr[III],size)){
                    merged=1;
                    break;
                }
            }
            if(!merged){
                // new cache block accessed over this coalescing width
                unq_addr[unq_accesses]=CACHE_BLOCK_ADDR(addr,size);
                unq_accesses++;
            }
        }
    }
    printf("%d threads make %d memory transactions\n",GRIDDIMX*GRIDDIMY*BLOCKDIMX*BLOCKDIMY, total_unq_accesses);
}

The code will run for every thread of the grid and calculates the number of merged requests, metric of memory access coalescing.

To use the analyzer, paste the index calculation portion of your code in the specified region and decompose the memory accesses (array) into 'address' and 'size'. I've already done this for your code where the indexings are:

int i       = threadIdx.x;
int j       = threadIdx.y;
int idx     = blockIdx.x*BLOCK_SIZE_X + threadIdx.x;
int idy     = blockIdx.y*BLOCK_SIZE_Y + threadIdx.y;

int index1  = j*BLOCK_SIZE_Y+i;

int i1      = (index1)%(BLOCK_SIZE_X+1);
int j1      = (index1)/(BLOCK_SIZE_Y+1);

int i2      = (BLOCK_SIZE_X*BLOCK_SIZE_Y+index1)%(BLOCK_SIZE_X+1);
int j2      = (BLOCK_SIZE_X*BLOCK_SIZE_Y+index1)/(BLOCK_SIZE_Y+1);

and the memory access is:

Ezx_h_shared_ext[i1][j1]=Ezx_h[(blockIdx.y*BLOCK_SIZE_Y+j1)*xdim+(blockIdx.x*BLOCK_SIZE_X+i1)];

The analyzer reports 4096 threads access to 4064 cache blocks. Run the code for your actual grid and block size and analyze the coalescing behavior.

lashgar
  • 5,184
  • 3
  • 37
  • 45
  • Pretty cool! NVIDIA also has an SDK that provides direct access to the performance counters in the chip. https://developer.nvidia.com/nvidia-perfkit – Roger Dahl Dec 08 '12 at 15:20
  • @RogerDahl Nice! Does the memory coalescing have any counter in the chip? – lashgar Dec 08 '12 at 15:29
  • I think coalescing is one of the things that are derived from other counters. The Nsight profiler has this blurb about the memory experiments: "Select this experiment group to identify memory-related performance bottlenecks of a kernel. For each memory space of the CUDA memory hierarchy key metrics are collected, including coalescing, bank conflicts, L1/L2 cache hit rates, and achieved bandwidths." The perf kit docs has some nice charts detailing the counters. Those can probably used for finding how to count coalescing. – Roger Dahl Dec 08 '12 at 15:51
  • Thank you for your interesting code. Summarizing RogerDahl's post, the 32 threads of a warp should address an aligned 128-bytes area to make a single transaction. So, according to the result of your code, my addressing is pretty inefficient since 4096 threads make 4064 transactions. To improve the efficiency, the 4064 transactions should be lowered, right? Also, the best that I could get with 4096 threads is 128 (32 threads make a single transaction, 4096/32=128, right? Thanks again for your support. – Vitality Dec 08 '12 at 20:47
  • Also, since in my case each of the 4096 threads should access 8 bytes (double), the 32 threads of a warp cannot be aligned to 128-bytes global memory (32*8=256). So, the minimum number for transactions for 4096 threads to access 8 bytes each is 256, right? – Vitality Dec 08 '12 at 21:10
3

As GPUs have evolved, the requirements for getting coalesced accesses have become less restrictive. Your description of coalesced accesses is more accurate for the earlier GPU architectures than the more recent ones. In particular, Fermi (compute capability 2.0) significantly loosened the requirements. On Fermi and later, it is not important to access the memory locations consecutively. Instead, focus has shifted to accessing memory with as few memory transactions as possible. On Fermi, global memory transactions are 128 bytes wide. So, when the 32 threads in a warp hit an instruction that performs a load or store, 128-byte transactions will be scheduled to service all the threads in the warp. Performance then depends on how many transactions are necessary. If all the threads access values within a 128-byte area that is aligned to 128 bytes, a single transaction is necessary. If all the threads access values in different 128-byte areas, 32 transactions will be necessary. That would be the worst case scenario for servicing the requests for a single instruction in a warp.

You use one of the CUDA profilers to determine the average for how many transactions were required for servicing the requests. The number should be as close to 1 as possible. Higher numbers mean that you should see if there are opportunities for optimizing the memory accesses in your kernel.

Roger Dahl
  • 15,132
  • 8
  • 62
  • 82
  • Thank you. According to Ahmad's code, 4096 threads make 4064 transactions. I would then conclude that my code is pretty inefficient. Am I right? – Vitality Dec 08 '12 at 20:53
  • @user1886641 Since the data of every 16 threads fit in 128-byte, every warp should send 2 requests ideally. The ideal case for your code is to send (4096/32)*2 = 256 requests. – lashgar Dec 09 '12 at 05:56
1

The visual profiler is a great tool for checking your work. After you have a piece of code functionally correct, then run it from within the visual profiler. On linux for example, assuming you have an X session, just run nvvp from a terminal window. You will then be given a wizard which will prompt you for the application to profile along with any command line parameters.

The profiler will then do a basic run of your app to collect statistics. You can also select more advanced statistic gathering (requiring addtional runs), and one of these will be memory utilization statistics. It will report memory utilization as a percentage of peak and will also flag warnings for what it considers to be low utilization that merits your attention.

If you have a utilzation number above 50%, your app is probably running the way you expect. If you have a low number, you have probably missed some coalescing details. It will report statistics separately for memory reads and memory writes. To get 100% or close to it, you will also need to make sure that your coalesced reads and writes from the warp are aligned on 128 byte boundaries.

A common mistake in these situations is to use the threadIdx.y based variable to be the most rapidly changing index. It doesn't seem to me that you've made that error. e.g. it's a common mistake to do shared[threadIdx.x][threadIdx.y] since this is often the way we think about it in C. But threads are grouped together first in the x axis, so we want to use shared[threadIdx.y][threadIdx.x] or something similar. If you do make this mistake, your code can still be functionally correct but you will get low percentage utilization numbers in the profiler, like around 12% or even 3%.

And as already stated, to get above 50% and close to 100%, you will want to make sure that not only are all your thread requests are adjacent, but that they are aligned on a 128B boundary. Due to L1/L2 caches, these aren't hard and fast rules, but guidelines. The caches may mitigate some mistakes, to some degree.

Robert Crovella
  • 143,785
  • 11
  • 213
  • 257
  • What do you mean by "utilization"? That all the cached memory transfers from the global memory are fully utilized? Thanks. – Vitality Dec 08 '12 at 21:27
  • Correct. When a memory transaction is triggered by a read request for example, a full 128 bytes is normally retrieved from memory. If my warp only needs a single 32 bit quantity, then I will only be using 4 of those 128 bytes. If all of my read activity were like this, I would see a utilization percentage of 4/128 = 3.125% But if instead all 32 threads in every warp is requesting an adjacent 32 bit value from the same 128 byte block at the same time (a *coalesced* access) then my utilization would be 100%, which is ideal. – Robert Crovella Dec 12 '12 at 17:57