2

I have a 2-dimensional NumPy array. Each row is sorted and contains a small number of elements (like 10), but there are a large number of rows (like 1e6). It might look something like this:

haystacks = [
    [1, 4, 7, 8],
    [2, 5, 5, 7],
    [10, 11, 25, 30],
    ...
]

I also have a one-dimensional array. This array has as many elements as the first array has rows. So, maybe:

needles = [10, 6, 15 ...]

I want to perform binary search for each element in the 1d array on the corresponding row in the 2d array. I would use np.searchsorted, but it doesn't seem to support this use-case.

I am using this in a large simulation of a physical system. So, performance is extremely important. The following code works, but it is too slow.

positions = []
for needle, haystack in zip(needles, haystacks):
   positions.append(np.searchsorted(haystack, needle))

print(positions)

NumPy solution is preferred. Other libraries are okay, but I am having trouble getting Numba working.

Anyone have any ideas?

Leo Ware
  • 109
  • 4
  • Have you tried allocating memory to `positions`? Use NumPy arrays rather than lists to store the variables. Should bring speed up. – vahvero Dec 13 '21 at 09:05
  • 1
    Considering how small the rows are, it might be faster to use a broadcasted *linear* search instead of a binary search inside a Python loop. – user2357112 Dec 13 '21 at 09:06

2 Answers2

1

Here is a working numba solution. You might want to replace enumerate with numba.prange if you have multiprocessing in mind.

import numpy as np
import numba
haystacks = np.array([
    [1, 4, 7, 8],
    [2, 5, 5, 7],
    [10, 11, 25, 30],
])

needles = np.array([10, 6, 16])
@numba.njit
def search(needles, haystacks):
    positions = np.zeros_like(needles)

    for idx, _ in enumerate(needles):
        
        positions[idx] = np.searchsorted( haystacks[idx], needles[idx],)

    return positions

print(search(needles, haystacks))

Numba gives better performance:

import timeit
print(timeit.timeit(lambda: search_np(needles, haystacks), number=100_000))
print(timeit.timeit(lambda: search_nb(needles, haystacks), number=100_000))
#1.103232195999908 for np
#0.3278064189998986 for nb
vahvero
  • 525
  • 11
  • 24
0

Here's a clever way to do this by adding the needles to the array (og_combo), sorting across axis 1 (quick), and finding the indexes where the difference is zero :

import numpy as np
import pandas as pd

haystack = np.random.randint(0, 10, size=(1_000_000, 4))
needles = np.random.randint(0, 10, size=(1_000_000,))

haystack = np.sort(haystack, axis=1)
haystack_inf = np.hstack((haystack, (np.ones(len(haystack)) * 1e10).reshape((-1, 1))))

og_combo = np.hstack((haystack_inf, needles.reshape(-1, 1)))
combo_sorted = np.sort(og_combo, axis=1)
diff = og_combo - combo_sorted
df = pd.DataFrame(np.argwhere(diff != 0), columns=["col1", "col2"])
final = df.groupby("col1")["col2"].first().values

Haystack (random example):

array([[1, 2, 7, 8],
       [0, 2, 7, 8],
       [2, 2, 4, 9],
       ...,
       [0, 3, 5, 6],
       [0, 1, 4, 9],
       [0, 4, 6, 7]])

Needles (same random example):

array([5, 6, 2, ..., 9, 0, 2])

Final positions (same random example):

array([2, 2, 2, ..., 4, 1, 1])
cosmic_inquiry
  • 2,557
  • 11
  • 23