Can this OpenCL code be optimized?

Question

I am working on a piece of OpencL code for a specialized matrix function: for a Dx1 vector v, two DxD matrices A and B and a constant c, return 1xD vector r where r[i] = c * sum_over_j (v[j] * A[i][j] * B[i][j])

Below is what I have so far, but it runs freakishly slow. A version without summing that returns a DxD matrix is about ten times faster. It's called from PyOpenCL if that makes any difference.

Is anything done wrong? Could it be optimized?

#define D 1000
...

   __kernel void element_mult(
      __global float *result,
      __global const float *vector,
      __global const float *matrix,
      __global const float *matrix2,
        const float factor)
      {
         int y = get_global_id(1);
         float sum = 0;
         for(int k = 0; k < D; k++)
         {
            sum += vector[k] * matrix[(y*D) + k]
            * matrix2[(y*D) + k ];
         }
         result[y] = sum * factor;
      }

Cheers!

You are sure, aren't you, that the computation of y*D is lifted out of the k loop by your compiler ? And that the common sub-expression (y*D)+k is only computed once in each iteration ? — High Performance Mark, Feb 23 '12 at 13:20
@talonmies, I can't be sure. The computation is not done locally on my computer; basically it just needs to be OpenCL. — trolle3000, Feb 23 '12 at 14:06
@trolle3000: The memory access patterns your code uses will be very bad for performance on an NVIDIA GPU, if that happens to be the target. This calculation is effectively the dot product of the Hadamard product of two matrices with a vector? — talonmies, Feb 23 '12 at 14:23
@talonmies, that's corect. Could you elaborate on the memory acces pattern? — trolle3000, Feb 23 '12 at 14:41
@talonmies, I think you did here: http://stackoverflow.com/questions/6045473/elementwise-operations-in-opencl-cuda — trolle3000, Feb 23 '12 at 14:43
@trolle3000: NVIDIA GPUs rely on a process called memory coalescing to achieve high memory bandwidth (and this is bandwidth limited). Ideally work items (threads) in the same SIMD group (warps) should read from sequential word locations. Your code would be reading with a pitch of `D` words. That will generate 32 times more memory transactions than would be the case if the reads could be coalesced. It can have a *huge* impact on performance. — talonmies, Feb 23 '12 at 14:52

mfa · Accepted Answer · 2012-02-24T03:34:06.220

Optimization #1: make vector __local.

My first pass at this got a decent improvement in performance. I noticed that each vector[k] is read a total of D times, so I copied it to a __local. This is only possible because D is small enough to allow this. The kernel as you have it above suffers from a terrible ALU:fetch ratio of 0.08 on both the 5870 and the 6970 gpus. Even the slower gpus are still waiting on the memory access.

   #define D 1000
    __kernel void element_mult(
    __global float *result,
    __global const float *vector,
    __global const float *matrix,
    __global const float *matrix2,
    const float factor)
    {
        int y = get_global_id(0);
        float sum = 0;

        __local float vectCopy[D];
        int ls = get_local_size(0);
        int lid = get_local_id(0);
        for(int i=0;i<D;i+=ls){
            vectCopy[i+lid] = vector[i+lid];
        }
        mem_fence(CLK_LOCAL_MEM_FENCE);

        for(int k = 0; k < D; k++)
        {
            sum += vectCopy[k] * matrix[(y*D) + k] * matrix2[(y*D) + k ];
        }
        result[y] = sum * factor;
    }

With this change, APP profiler is showing a new ALU:fetch ratio of 0.20 for the 5870 and 6970 gpus. Average times changed from 1513-->1034, and 1261-->861 on the same cards. The low end gpus are now bound by ALU instead of fetch. (greater than 4:1 ratio)

Opimization #2: calculate each result[y] using an entire work group.

You would have to do this id D were much larger (100k+). The idea is to get the best memory access pattern by using the work group to compute a single element of the result at a time. I defined ls (local size) to be 64 here, because it works on my hardware, as well as most vendors'. The workgroup size you use from the host-side will have to be 64 unless you change that definition. It needs to be defined to create the sum[ls] storage as __local, and I don't like passing variable sized __local vars into my kernels.

results: 5870 ALU:fetch=0.59:1, avg=708. 6970 ALU:fetch=0.72, avg=590. According to APP profiler, this is about twice as fast as your original listing.

#define D 1000
#define ls 64
__kernel void element_mult(
__global float *result,
__global const float *vector,
__global const float *matrix,
__global const float *matrix2,
const float factor)
{
    __local float vectCopy[D];
    int lid = get_local_id(0);
    for(int i=0;i<D;i+=ls){
        vectCopy[i+lid] = vector[i+lid];
    }
    mem_fence(CLK_LOCAL_MEM_FENCE);

    int ng = get_num_groups(0);
    int gid = get_group_id(0);
    int y, k;
    __local float sum[ls];
    for(y = gid; y < D; y+=ng){
        for(k = lid; k < D; k+=ls)
        {
            sum[lid] += vectCopy[k] * matrix[(y*D) + k] * matrix2[(y*D) + k ];
        }
        if(lid==0){
            result[y] = sum[0];
            for(k=1;k<ls;k++){
                result[y] += sum[k];
            }
            result[y] *= factor;
        }
        mem_fence(CLK_LOCAL_MEM_FENCE);
    }
}

EDIT: APP profiler = AMD APP KernelAnalyzer

Can this OpenCL code be optimized?

1 Answers1

Linked