OpenCl matrix transposition with memory coalescing

Question

I´m currently trying to transpose a Matrix in OpenCl with memory coalescing.

I've already tansposed the Matrix in a "simple" way which worked perfectly fine. When I tried to do the same thing now with memory coalescing, i was hoping to see a little improvement in the execution time, but my implementation is actually slower than the simple implementation (The implementation is correct, it's just not efficent). I think I haven't exactly understood how to ensure that the horizontally neighboring work-items write on horizontally neighboring addresses.

Here is the Kernel for my coalisced implementation:

__kernel void MatrixTranspose(__global const float* Matrix, 
__global float* MatrixTransposed, uint Width, uint Height, __local float* block) {

    int2 GlobalID;
    GlobalID.x = get_global_id(0);
    GlobalID.y = get_global_id(1);

    int2 LocalID;
    LocalID.x = get_local_id(0);
    LocalID.y = get_local_id(1);

    block[LocalID.y*get_local_size(0) + LocalID.x] = Matrix[GlobalID.y*Width + GlobalID.x];

    barrier(CLK_LOCAL_MEM_FENCE);

    int2 groupId;
    groupId.x = get_group_id(0);
    groupId.y = get_group_id(1);
    int2 localSize;
    localSize.x = get_local_size(0);
    localSize.y = get_local_size(1);
    MatrixTransposed[Height*(LocalID.x + groupId.x*localSize.x) + Height - (LocalID.y + groupId.y*localSize.y) - 1] = block[LocalID.y*localSize.x + LocalID.x];
}

I hope someone can give me an advice, thank you :)

mfa · Accepted Answer · 2014-08-19T14:33:05.520

Unfortunately, you are going to be bound by your global read and write speed of the device. Normally you transpose the matrix to do some calculation, and that helps hide the latency. You are reading to local memory, waiting for the barrier, and writing black to global in your example. This only adds the extra step and complexity of using local memory.

You should do something with the data while it is in local memory if you want to hide the global memory latency.

If all you want to do is transpose the matrix, simply read from global and write to the target location in global directly. Maybe look into async_work_group_copy if you still want to try using local memory.

Now for my answer.

Try making a work item responsible for more than a single float. If you read a 4x4 region with a work item, you can transpose it in private memory. This would not only skip local memory, but eliminate the need for a barrier, and reduce the number of work items you need by a factor of 16.

steps:

calculate src and dest global memory addresses
load four float4 values from global
transpose the 4x4 floats by swapping their w,x,y,z values accordingly
store 4 float4 values at new location in global memory
handle the edge regions of the matrix in a separate kernel, or in the host program for matrices with non-multiple-of-four dimensions (or pad your input matrix to make it multiple of 4)

*"Unfortunately, you are going to be bound by your global read and write speed of the device. Normally you transpose the matrix to do some calculation"* - that's correct. Most standard BLAS libraries even don't transpose matrices *explicitly*, but instead, offer "flags" that indicate whether a matrix should be *interpreted* as being transposed. They then may use a completely different kernel that reads this matrix not in column-major but in row-major order (or vice versa). — Marco13, Aug 19 '14 at 14:32
As far as I understood, in the simple implementation, every work-item writes to the global memory sequentially while in the coalisced implementation, several write operations can be carried out in one step. Shouldn't there still be a tiny boost in the execution time? — MiepMiep, Aug 19 '14 at 16:16
yes, but if your sequential writes happen to be adjacent, you will have the memory coalescing for free. the memory controller will write several (usually 4, but depends on the pipeline width) floats at the same time. the same idea applies to reads: you are going to read 4 values anyways, it's just a matter of how many you would like to throw away -- and possibly re-read later. — mfa, Aug 19 '14 at 18:51

score 1 · Answer 2 · answered Aug 19 '14 at 09:52

Read columns from original matrix, write them as rows in local memory to avoid memory bank conflicts, and then store rows from local memory into transposed matrix.

In this case two write operations are easy to coalesce (neighbour Work Items write to neighbour memory cells). Read operations are not so good, however.

BTW, what is your Device? If it's good with vector operations, use vload/vstore operations, it may improve IO performance significantly.

OpenCl matrix transposition with memory coalescing

2 Answers2