0

Suppose I have some nearest neighbor classifier. For a new observation it computes the distance between the new observation and all observations in the "known" data set. It returns the class label of the observation, that has the smallest distance to the new observation.

import numpy as np

known_obs = np.random.randint(0, 10, 40).reshape(8, 5)
new_obs = np.random.randint(0, 10, 80).reshape(16, 5)
labels = np.random.randint(0, 2, 8).reshape(8, )

def my_dist(x1, known_obs, axis=0):
    return (np.square(np.linalg.norm(x1 - known_obs, axis=axis)))

def nn_classifier(n, known_obs, labels, axis=1, distance=my_dist):
    return labels[np.argmin(distance(n, known_obs, axis=axis))]

def classify_batch(new_obs, known_obs, labels, classifier=nn_classifier, distance=my_dist):
    return [classifier(n, known_obs, labels, distance=distance) for n in new_obs]

print(classify_batch(new_obs, known_obs, labels, nn_classifier, my_dist))

For performance reasons I would like to avoid the for loop in the classify_batch function. Is there a way to use numpy operations to apply the nn_classifier function to each row of new_obs? I already tried apply_along_axis but as often mentioned it is convenient but not fast.

nhoeft
  • 535
  • 6
  • 17
  • The loop itself isn't the cause of the slowdown. _Anything_ that involves directly calling a python function will be slow. Look for ways to use [broadcasting](https://docs.scipy.org/doc/numpy/user/basics.broadcasting.html) and [ufuncs](https://docs.scipy.org/doc/numpy/reference/ufuncs.html). – senderle May 06 '17 at 22:58
  • Have you tried k-means? – Divakar May 06 '17 at 22:59
  • See also [this question](http://stackoverflow.com/questions/8079061/function-application-over-numpys-matrix-row-column) (almost a duplicate) and [this message](http://www.mail-archive.com/numpy-discussion@scipy.org/msg00587.html) linked to from the top answer. – senderle May 06 '17 at 23:08

1 Answers1

1

The key to avoiding the loop is to express the action on the (16,8) array of 'distances'. The labels[] and argmin steps just cloud the issue.

If I set labels = np.arange(8), then this

arr = np.array([my_dist(n, known_obs, axis=1) for n in new_obs])
print(arr)
print(np.argmin(arr, axis=1))

produces the same thing. It still has a list comprehension, but we are closer to 'source'.

[[  32.  115.   22.  116.  162.   86.  161.  117.]
 [ 106.   31.  142.  164.   92.  106.   45.  103.]
 [  44.  135.   94.   18.   94.   50.   87.  135.]
 [  11.   92.   57.   67.   79.   43.  118.  106.]
 [  40.   67.  126.   98.   50.   74.   75.  175.]
 [  78.   61.  120.  148.  102.  128.   67.  191.]
 [  51.   48.   57.  133.  125.   35.  110.   14.]
 [  47.   28.   93.   91.   63.   49.   32.   88.]
 [  61.   86.   23.  141.  159.   85.  146.   22.]
 [ 131.   70.  155.  149.  129.  127.   44.  138.]
 [  97.  138.   87.  117.  223.   77.  130.  122.]
 [ 151.   78.  211.  161.  131.  115.   46.  164.]
 [  13.   50.   31.   69.   59.   43.   80.   40.]
 [ 131.  108.  157.  161.  207.   85.  102.  146.]
 [  39.  106.   67.   23.   61.   67.   70.   88.]
 [  54.   51.   74.   68.   42.   86.   35.   65.]]
[2 1 3 0 0 1 7 1 7 6 5 6 0 5 3 6]

With

print((new_obs[:,None,:] - known_obs[None,:,:]).shape)

I get a (16,8,5) array. So can I apply the linalg.norm on the last axis?

This seems to do the trick

np.square(np.linalg.norm(diff, axis=-1))

So together:

diff = (new_obs[:,None,:] - known_obs[None,:,:])
dist = np.square(np.linalg.norm(diff, axis=-1))
idx = np.argmin(dist, axis=1)
print(idx)
hpaulj
  • 221,503
  • 14
  • 230
  • 353