2

I have a numpy array centers which is N row by 3 columns and contains the 3D coordinates of the center of the spheres. Then another Nx1 array radii which contains the radii corresponding to the spheres. Finally, I have a third array points which are the points that I want to check to see if they are inside of the sphere. To do the check, I know that I need to find d = np.sqrt((points[0]-sphere[0])**2 + (points[1]-sphere[1])**2 + (points[2]-sphere[2])**2) for all the points and all the spheres but I'm not sure how to do this in a vectorized manner. Is this possible? Thank you.

Additional info:

  • I would like to run this on the GPU using CuPy, or otherwise have extremely fast performance.
  • There are between 2-200 spheres, and 300k points.
adamcircle
  • 694
  • 1
  • 10
  • 26
  • `numpy` only? `scipy` has many more distance methods for `numpy` arrays. Take a look [here](https://stackoverflow.com/questions/52366421/how-to-do-n-d-distance-and-nearest-neighbor-calculations-on-numpy-arrays/52366706#52366706) to see the different best-practice methods. – Daniel F Apr 20 '22 at 12:10
  • 1
    This can be done efficiently with [Scipy's KD Tree](https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.KDTree.query_ball_point.html#scipy.spatial.KDTree.query_ball_point) – norok2 Apr 20 '22 at 12:12
  • @DanielF I'm actually looking to use CuPy so that this can be done on the GPU. I don't think scipy spatial is supported but the other methods might be. – adamcircle Apr 20 '22 at 12:14
  • How big is N and and the number of points? – Jérôme Richard Apr 20 '22 at 14:31
  • @JérômeRichard 300k points, somewhere between 2 and 200 spheres – adamcircle Apr 20 '22 at 14:53

2 Answers2

3

When the number of sphere is large, it is generally a good idea to use a data structure like a quadtree, k-D tree and especially ball tree. This is what Scipy uses internally and this result in an algorithm running in O(M log(N)) time where N is the number of sphere and M is the number of points.

The thing is implementing such data structure efficiently on GPU is pretty insane (because they are not SIMD friendly). In your case, the number of sphere is small enough so that using a naive algorithm running in O(N M) should be relatively fast.

The solution of @DanielF is a good start but is is not very efficient as is creates several temporary arrays and use a memory access pattern that is not SIMD friendly (which is critical on GPU so to get reasonable performances). The temporary arrays cause slow allocations and memory accesses while GPU shine for heavy computational operations (typically on large arrays).

To solve this problem, you can write your own kernel. The thing is the input array are not efficiently stored. They should be transposed so the computation can make more SIMD-friendly memory accesses and transposing arrays is relatively slow on GPU (as well as creating new arrays). Please consider using a more suitable input data layout. Additionally, computing the square root is not required since you can compare the result to the squared radius. Here is an example computing a "collision" matrix (where rows are spheres and columns are points):

import cupy as cp

points = cp.random.rand(300_000, 3)
spheres = cp.random.rand(200, 3)
radii = cp.random.rand(200)

checkInSphere = cp.RawKernel(r'''
extern "C" __global__
void checkInSphere(const double* points, int pointCount, const double* spheres, const double* radii, int sphereCount, bool* found) {
    int tid = blockDim.x * blockIdx.x + threadIdx.x;

    if(tid < pointCount)
    {
        const double px = points[tid];
        const double py = points[tid+pointCount];
        const double pz = points[tid+pointCount*2];

        for(int i=0 ; i<sphereCount ; ++i)
        {
            const double r = radii[i];
            const double sx = spheres[i];
            const double sy = spheres[i+sphereCount];
            const double sz = spheres[i+sphereCount*2];
            const double dx = px - sx;
            const double dy = py - sy;
            const double dz = pz - sz;
            const double sqDist = dx * dx + dy * dy + dz * dz;
            found[i*pointCount+tid] = sqDist <= r * r;
        }
    }
}
''', 'checkInSphere')

threadPerBlock = 64
blockCount = (points.size + (threadPerBlock - 1)) // threadPerBlock
found = cp.empty((spheres.shape[0], points.shape[0]), dtype=cp.bool_)
checkInSphere((blockCount,), (threadPerBlock,), (points.T.copy(), points.shape[0], spheres.T.copy(), radii, spheres.shape[0], found))

This code is about 4 times faster in double precision than the one of @DanielF (ie. the last version without the square root and the check). The thing is my GPU like most GPUs is VERY slow to operate on double-precision (DP) floating-point numbers. In fact, mainstream client-side GPUs are not designed for such workload and are generally slower than CPUs for that. However, (expensive) server-side GPUs can efficiently operate on double-precision numbers. Thus, please consider working with simple-precision floating-point (SP) numbers if you are using a basic mainstream GPU. If you really want to use DP, then this computation is certainly faster on a CPU using tools like Numba. Note that using SP operations are generally also faster on server-side GPUs as well as CPUs. On my machine, the SP version of this code takes about 0.7 ms and is about 15 times faster than the one of @DanielF (updated version).

For better performance, please consider changing the input layout, pre-allocate all the buffers, pre-compute the squares of radii, and make use of shared memory so to load the sphere locations only once. Loop unrolling and register tiling should also help to improve performance even further though it will make the code barely readable. One could also store only the index of the closest sphere for each point, or even only a boolean to say if the point is in any sphere so to reduce the size of the output (memory accesses are expensive).

Jérôme Richard
  • 41,678
  • 6
  • 29
  • 59
1

Since CuPy locks us out of scipy.spatial functions, we can fall back to numpy-only functions. In this case, based on the first part of the answer here and (inexpertly) converted to CuPy methods:

d_sq =  cupy.einsum('ij, ij ->i', centers , centers)[:, None] +\   
        cupy.einsum('ij, ij ->i', points , points) -\          
        2 * cupy.dot(centers, points.T)

This squared distance matrix can be then checked against radii by d_sq < radii[:, None] ** 2. This saves a numerically-costly square root step.

Daniel F
  • 13,620
  • 2
  • 29
  • 55
  • This solution is not efficient if N is big as the complexity is `O(N M)` while there are faster algorithm running in `O(M log(N))` time. Scipy uses the efficient algorithm so it can be faster on the CPU than this GPU code for large N (which is sad). Additionally, I do not expect `cupy.linalg.einsum` to be fast in this case. – Jérôme Richard Apr 20 '22 at 14:36
  • I'm hardly an expert in `CuPy`, so I shall defer there. However I think your method is mostly gaining speedup from not doing `sqrt` and instead squaring the targets, which is a trick I put in my reference answer but then completely forgot to do here >.<. I have since edited to add. – Daniel F Apr 21 '22 at 07:46
  • Indeed, the square root is slow, especially in raw kernels, but not so much in your solution. I updated the timings and now the results are is x4 in DP (previously x5) and x15 in SP (previously x20). The solution as the benefit of being simpler though. Btw, AFAIK `einsum` is not in `cupy.linalg` (it does not work on my machine) and the `\` fail on my ipython terminal. – Jérôme Richard Apr 21 '22 at 11:33
  • Admittedly, I based my `cupy` translation on documentation found [here](https://docs.cupy.dev/en/stable/reference/linalg.html) which includes `einsum`, but did not test (as I don't have a relevant environment on my machine). Perhaps it is a version issue, or perhaps it is simply a case of the documentation not matching the actual implementation. – Daniel F Apr 21 '22 at 14:09
  • Aand digging a little deeper into those docs, it seems I should not have included the `.linalg` - fixed (I think) – Daniel F Apr 21 '22 at 14:13
  • And also the question has been updated from "`cupy`" to "`cupy` or fast" in which case ball-tree is absolutely the way to go. – Daniel F Apr 21 '22 at 14:16