CuPy parallel searchsorted

Question

The current searchsorted works perfect with 1-d arrays:

import cupy as cp
ref = cp.arange(5, dtype=cp.float32)
secs = cp.arange(10, dtype=cp.float32)
cp.searchsorted(ref,secs)

the result is:

array([0, 1, 2, 3, 4, 5, 5, 5, 5, 5])

I want a function multi_searchsorted that search each row of ref in parallel, for example:

For ref:

array([[0., 1., 2., 3., 4.],
       [5., 6., 7., 8., 9.]], dtype=float32)

and secs:

array([[0., 1., 2., 3., 4., 5., 6., 7., 8., 9.],
       [5., 6., 7., 8., 9., 0., 1., 2., 3., 4.]], dtype=float32)

It will produce

array([[0, 1, 2, 3, 4, 5, 5, 5, 5, 5],
       [0, 1, 2, 3, 4, 0, 0, 0, 0, 0]])

Does anybody know if there is any open source code implement it? If I have to write the code, do I need to rewrite the kernel from beginning or if I can reuse the existing searchsorted function in some way?

Any other suggestions are welcomed.

The easiest solution might be to just asynchronously do multiple calls to `searchsorted` on multiple CUDA streams. There is probably not much to gain from doing it in a single launch. — paleonix, Dec 15 '22 at 11:34
In my opinion, the gains from a single launch could be significant if the row width is small, say, less than 10,000 elements, and especially if there are a large number of "small" rows. In that case, I would either [write my own segmented binary search kernel](https://stackoverflow.com/questions/21658518/search-an-ordered-array-in-a-cuda-kernel), which should be doable in cupy RawKernel, or else (better) use a vectorized search from thrust. I still wonder if there is a clever way to do the segmented search using cupy searchsorted, however. Perhaps if the domain is limited, with offsets. — Robert Crovella, Dec 16 '22 at 01:20
[this](https://stackoverflow.com/questions/40588403/vectorized-searchsorted-numpy) may be of interest. — Robert Crovella, Dec 16 '22 at 01:23
Thank you for your comments @paleonix @robert-crovella. I find pytorch implement n-dimensional `searchsorted`. Since there is no cost to convert data between pytorch and cupy, this should be the simplest way. But the speed needs to be test. I will test the speed of pytorch `searchsorted` and a clever version of cupy `searchsorted`. — Kang Liang, Dec 16 '22 at 07:11

score 3 · Accepted Answer · answered Dec 16 '22 at 22:26

pytorch have a nD searchsorted and there is a smart way to convert nD problem to 1D as @robert-crovella mentioned. I write two version of functions to test their speed:

import cupy as cp
import torch
from cupyx.profiler import benchmark

The first one is basically a simple wrapper around pytorch searchsorted:

def torch_searchsorted(ref,sec)
    _ref = torch.as_tensor(ref)
    _sec = torch.as_tensor(sec)
    indices = torch.searchsorted(_ref,_sec,side='right')
    indices = cp.asarray(indices)
    return indices

The second is the smart way:

def cupy_searchsorted(a,b):
    m,n = a.shape
    max_num = cp.maximum(a.max() - a.min(), b.max() - b.min()) + 1
    r = max_num*cp.arange(a.shape[0])[:,None]
    p = cp.searchsorted( (a+r).ravel(), (b+r).ravel(), side='right' ).reshape(m,-1)
    return p - n*(cp.arange(m)[:,None])

First test the correctness:

ref = cp.arange(20, dtype=cp.float32).reshape(4,5)
sec = cp.arange(-1,19, dtype=cp.float32).reshape(4,5)
torch_out = torch_searchsorted(ref,sec)
cupy_out = cupy_searchsorted(ref,sec)

The results are same:

array([[0, 1, 2, 3, 4],
       [0, 1, 2, 3, 4],
       [0, 1, 2, 3, 4],
       [0, 1, 2, 3, 4]])

Then test the speed:

width = 100
nlines = 10000
ref = cp.arange(-1.1,-1.1+width*nlines, dtype=cp.float32).reshape(nlines,width)
sec = cp.arange(-1.5,-1.5+width*nlines, dtype=cp.float32).reshape(nlines,width)
print(benchmark(torch_searchsorted,(ref, sec), n_repeat=100))
print(benchmark(cupy_searchsorted,(ref, sec), n_repeat=100))

The result is much suprising:

torch_searchsorted  :    CPU: 8305.727 us   +/-737.419 (min: 8109.499 / max:15409.231) us     GPU-0: 8937.603 us   +/-737.363 (min: 8741.632 / max:16042.624) us
cupy_searchsorted   :    CPU:  354.953 us   +/- 9.935 (min:  346.325 / max:  429.802) us     GPU-0:  378.429 us   +/- 9.438 (min:  370.688 / max:  450.560) us

The cupy version is much faster. I wonder if the data conversion costs too much time, so I do further test:

_ref = torch.as_tensor(ref)
_sec = torch.as_tensor(sec)
print(benchmark(torch.searchsorted,(_ref, _sec), n_repeat=100))

I get

searchsorted        :    CPU: 5669.537 us   +/-86.444 (min: 5646.031 / max: 6487.139) us     GPU-0: 5677.324 us   +/-86.665 (min: 5653.792 / max: 6496.576) us

It gets faster but still not as fast as the cupy one. The test is on V100. pytorch version is

pytorch 1.13.1 py3.9_cuda11.7_cudnn8.5.0_0 pytorch

cupy version is 11.2.

Conclusion:

Even pytorch searchsorted support nD, the performance is not so good.
The data conversion between cupy and pytorch is not at no cost.

CuPy parallel searchsorted

1 Answers1