1

I am working on a matching problem. I have two arrays A and B of the same size (assume 1000x1000). For each element a_ij in A (i and j are the row and column indices, respectively), I need to find the closest element b_i? in the i-th row of B. Right now, I have come up with three solutions:

  1. Nested loop (Takes 5.47s)
  2. Single loop with broadcasting (Takes 6.40s)
  3. Parallel loop (Takes 30.27s)

I think the above three methods are not time-efficient enough, or at least my implementation is not good enough to achieve that (the 3rd method takes longer than I expect!). It would be nice if you can point out how can I improve my code. Thanks in advance!

My current solutions are as follows:

import numpy as np
import time
from joblib import Parallel, delayed

def method1(A,B): # Nested loop (as a Benchmark)
    output = np.zeros(A.shape)
    for r in range(A.shape[0]):
        rowB = B[r]
        for c in range(A.shape[1]):
            elementA = A[r,c]
            diff_vector = np.abs(elementA - rowB)
            output[r,c] = np.argmin(diff_vector)
    return output

def method2(A,B): # Broadcasting
    output = np.zeros(A.shape)
    for r in range(A.shape[0]):
        diff_matrix = np.abs(A[r][:, np.newaxis] - B[r])
        output[r] = np.argmin(diff_matrix, axis=1)
    return output

def method3(A,B): # Parallel for loop
    def matcher(r, A, B):
        i = r//A.shape[1]
        j = r % A.shape[1]

        elementA = A[i, j]
        rowB = B[i]
        diff_vector = np.abs(elementA - rowB)
        return np.argmin(diff_vector)

    output = Parallel(n_jobs=4)(delayed(matcher)(r, A, B) for r in range(A.shape[0]*A.shape[1]))
    output = np.reshape(output, [A.shape[0], A.shape[1]])
    return output

A = np.random.randn(1000,1000)
B = np.random.randn(1000,1000)
output1 = method1(A,B)
output2 = method2(A,B)
output3 = method3(A,B)
J. Leung
  • 95
  • 5
  • You are comparing each *point* in A with all values B's respective row: for broadcasting without a loop it seems like you need to add a dimension to B to make it orthogonal to A; if A and B are planes perpendicular to your point of view, you need to rotate B 90 degrees in the third dimension. Memory consumption will *explode* to a 1000x1000x1000 array. Sorry I don't have numpy installed here to test. – wwii Jan 06 '22 at 17:10
  • If the memory consumption is a problem you can maybe iterate over 100 (or 10,...,50) rows at a time and perform that broadcasting on smaller chunks while still reducing the number of Python loop iterations. ... For each 100 row slice of A and B; rotate the B slice; subtract that from the A slice; use argmin on the result. – wwii Jan 06 '22 at 17:14
  • sorting B and A along row should help finding closest values faster - by iterating along A and B simultaneously. Then you would have to "unsort" results. – dankal444 Jan 06 '22 at 18:30

1 Answers1

1

The algorithm is clearly inefficient (all methods): checking all items of rowB to find the one that is the closest is expensive and results in several billions of floating-point operations. Not to mention Numpy creates an expensive unnecessary temporary array for each call to elementA - rowB and np.abs.

You can speed up this by doing a sort followed by binary searches. The complexity in time of this approach is O(n**2 log n), while for the initial approach it is O(n**3). You can also write a parallel implementation of this algorithm easily using Numba. Moreover, you do not need to store the output in a float64-typed array: an integer-based array is enough (faster and possibly smaller).

Here is the resulting implementation:

import numba as nb
import numpy as np

@nb.njit('int_[:,::1](float64[:,::1],float64[:,::1])', parallel=True)
def method4(A,B):
    mB = B.shape[1]
    output = np.empty(A.shape, dtype=np.int_)

    # Parallel loop
    for r in nb.prange(A.shape[0]):
        rowA = A[r]
        rowB = B[r]

        # Sort a row
        index_rowB = np.argsort(rowB)
        sorted_rowB = rowB[index_rowB]

        # Fast binary search in the sorted row
        # See: https://stackoverflow.com/a/46184652/12939557
        idxs = np.searchsorted(sorted_rowB, rowA)
        left = np.fabs(rowA - sorted_rowB[np.maximum(idxs-1, 0)])
        right = np.fabs(rowA - sorted_rowB[np.minimum(idxs, mB-1)])
        prev_idx_is_less = (idxs == mB) | (left < right)

        # Find the position in the original unsorted array
        output[r] = index_rowB[idxs - prev_idx_is_less]

    return output

output4 = method4(A,B)

This is 482 times faster than the fastest initial implementation (method1) on my 10-core machine:

method1:  4341 ms
method2:  5839 ms
method4:     9 ms
Jérôme Richard
  • 41,678
  • 6
  • 29
  • 59
  • This could been optimized by sorting A rows? I think it could be O(n**2) then – dankal444 Jan 06 '22 at 19:38
  • 1
    If you sort A rows, then the algorithm still runs in `O(n**2 log n)` because of the sorts. Additionally, then there is no need for a binary search, but the sorting should be slower (x2). Thus, I do not think it will be significantly faster. Note that the `argsort` takes 50% of the time and `searchsorted` roughly 45% on my machine. – Jérôme Richard Jan 06 '22 at 20:16
  • You are perfectly right! BTW how do you profile numba code? – dankal444 Jan 06 '22 at 23:56
  • 1
    This is unfortunately pretty hard. One simple solution is to cheat by measuring the time with only the useful operation and add some side effects so the JIT cannot optimise the operations. This is what I have done here. Another method is to use a low-level hardware-event-based profiler like perf or VTune. The thing is the Numba functions are unnamed and sometimes the assembly code is not even accessible by profilers (bug hopefully events are always accessible)... – Jérôme Richard Jan 07 '22 at 00:23
  • @Jérôme Richard Thanks for your suggestions. They are very useful. – J. Leung Jan 07 '22 at 06:01
  • @JérômeRichard Thank you for info. I did small research on google and found no simple solution either. – dankal444 Jan 07 '22 at 08:43