OpenCL fast iteration through all pairs

Question

I am quite new in OpenCL and it makes me problem to think about all GPU execution consequences. I am trying to write sumuation, so I have 2D points and need to calculate "gravity" forces acting in between them all. My best idea of OpenCL kernel looks like this:

kernel void ker_fun(global const double* pts, uint pts_size, global double* fxs, global double* fys, double vertexRepulsion)
{
    double x=pts[2*get_global_id(0)];
    double y=pts[2*get_global_id(0)+1];
    double fx=0;
    double fy=0;
    for (size_t i=get_global_id(0)+1; i<pts_size; ++i) {
        double dx=x-pts[2*i];       // point[i] -> points[THIS]
        double dy=y-pts[2*i+1];
        double r2=pow(dx, 2)+pow(dy, 2);
        r2=max(r2, 0.0001);      // to prevent (r2==0) issue
        double f=gravityConstant/r2;
        double ratio=f/sqrt(r2);
        dx*=ratio;
        dy*=ratio;
        fx+=dx;
        fy+=dy;
        atomic_add_double(&fxs[i], -dx);
        atomic_add_double(&fys[i], -dy);
    }
    atomic_add_double(&fxs[get_global_id(0)], fx);
    atomic_add_double(&fys[get_global_id(0)], fy);

where fxs and fys are force values in X and X direction (i.e. my result) and atomic_add_double function is copied from this site (OpenCL - using atomic reduction for double).

This function works and calculates desired result. But it is slow. Could you please advise me, how to do this different and better way.

Thank you, for your time and help

It's the atomics that are slowing it down. Maybe you can store results per-work-item and then do a parallel sum reduction step afterwards. — Dithermaster, Apr 21 '18 at 23:01
@Dithermaster how about cache. Is this not problem too? I had an algorithm, which for each point iterated through all another and stored result in private memory (without any kind synchronization/atomics). It was even slower. So atomics are slow, but it is probably not so bad. — google2, Apr 22 '18 at 08:34
As @Dithermaster said, there is a good chance that the synchronization slows down your kernel. However, there are other factors such as work group size and global cache management that might play a role in your problem as well. Could you provide a call script for your kernel, with values for your arrays. — denvercoder, Apr 23 '18 at 14:39

OpenCL fast iteration through all pairs

0 Answers0