0

I've been recently trying to compute distances to top 2 nearest neighbors in Python Numba as follows

@jit(nopython=True)
def _latent_dim_kernel(data, pointers, indices, nrange, sampling_percentage = 1):

    pdists_t2 = np.zeros((nrange, 2))
    for a in range(nrange):

        rct = 0
        for b in range(nrange):
            if np.random.random() > 1- sampling_percentage:
                if a == b:
                    continue

                r1 = _get_sparse_row(a, data, pointers, indices)
                r2 = _get_sparse_row(b, data, pointers, indices)

                dist = np.linalg.norm(r2 - r1)

                if rct > 1:
                    if pdists_t2[a,0] > dist:
                        pdists_t2[a,0] = dist

                    elif pdists_t2[a,1] > dist:
                        pdists_t2[a,1] = dist
                else:
                    pdists_t2[a,rct] = dist

                rct += 1

    return pdists_t2

The data, pointers and indices are x.data, x.indptr, x.indices of a CSR matrix (scipy). This works fine, however, is substantially slower than doing

squareform(pdist(matrix)).sort(axis=1)[:,1:3]

How can I speed this further without additional memory overhead?

Thanks!

sdgaw erzswer
  • 2,182
  • 2
  • 26
  • 45
  • Instead optimizing this algorithm, have you tried using a data structure, that's optimized for this? E.g. scipy.spatial.cKDTree: https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.cKDTree.html – MaxNoe Feb 01 '20 at 09:22
  • What is the approx. shape and sparisity of the input matrix? – max9111 Feb 01 '20 at 13:58
  • @MaxNoe thanks for the suggestion, no, I did not plan to go as deep, however I might need to. And max9111, sparsity is around 0.05, with 1000 rows and 20k columns. – sdgaw erzswer Feb 02 '20 at 12:14

1 Answers1

0

Make use of pairwise distances from sklearn

  • Pairwise distances of sparse matrices are supported (no dense temporary array needed)
  • This algorithm uses a algebraic reformulation like in this answer
  • It can be a lot faster on high dimensional problems like yours (20k) since most of the calculation is done within a highly optimized matrix-matrix product.
  • Check if this method is precise enough, it is less numerically stable than a "naive" approach pdist uses

Example

import numpy as np
from scipy import sparse
from sklearn import metrics
from scipy.spatial import distance

matrix=sparse.random(1_000, 20_000, density=0.05, format='csr', dtype=np.float64)

%%timeit
dist_2=distance.squareform(distance.pdist(matrix.todense()))
dist_2.sort(axis=1)
dist_2=dist_2[:,1:3]
#10.1 s ± 23.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
dist=metrics.pairwise.euclidean_distances(matrix,squared=True)
dist.sort(axis=1)
dist=np.sqrt(dist[:,1:3])
#401 ms ± 13.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
max9111
  • 6,272
  • 1
  • 16
  • 33