5

I'm running into a performance bottleneck when using a custom distance metric function for a clustering algorithm from sklearn.

The result as shown by Run Snake Run is this:

enter image description here

Clearly the problem is the dbscan_metric function. The function looks very simple and I don't quite know what the best approach to speeding it up would be:

def dbscan_metric(a,b):
  if a.shape[0] != NUM_FEATURES:
    return np.linalg.norm(a-b)
  else:
    return np.linalg.norm(np.multiply(FTR_WEIGHTS, (a-b)))

Any thoughts as to what is causing it to be this slow would be much appreciated.

houbysoft
  • 32,532
  • 24
  • 103
  • 156
  • how big are those arrays? if you get rid of the if statement and enforce one of the two statements for the dataset does that speed it up? ... you could try `len(a) != NUM_FEATURES` instead and see if thats any faster ... – Joran Beasley Jul 21 '14 at 18:47
  • 1
    the answer depends on how big the arrays a and b are. also which numpy version are you using? given the profile they are probably small and you are dominated by python overhead, in that case you would need to use cython to reduce that – jtaylor Jul 21 '14 at 18:58
  • 1
    Does it seem odd that 71 sec are spent in norm@linalg.py and 170 sec are spent elsewhere? I thought I knew how to make sense of this diagram, but that just seems odd. All I can guess is that the extra 170 sec are somehow involved in call overhead. Could you try inlining? – user1245262 Jul 21 '14 at 18:58
  • 1
    The out of local scope lookup of NUM_FEATURES and FTR_WEIGHTS might also be taking a while – sirlark Jul 21 '14 at 19:00
  • @JoranBeasley: I tried len(a) as well and it was similar. The arrays are not that big; NUM_FEATURES is 130. I believe my entire dataset is like that, but for some reason `sklearn` sometimes calls the function with a smaller length, which is why I had to add the length check. – houbysoft Jul 21 '14 at 19:39
  • @jtaylor: I'm using numpy 1.6.2; the arrays a and b should be 130 floats each. I'll look into cython, but if it's calling overhead wouldn't it just call back into numpy (which I don't think is compiled) anyway? – houbysoft Jul 21 '14 at 19:40
  • @user1245262: the function is passed to sklearn as a callable so I can't inline it as far as I know. – houbysoft Jul 21 '14 at 19:41
  • You can probably speed it up a bit by following the advice in this answer: http://stackoverflow.com/questions/9171158/how-do-you-get-the-magnitude-of-a-vector-in-numpy/9184560#9184560. – user545424 Jul 21 '14 at 20:55
  • @user545424: just tried that and it is indeed a bit faster (220 seconds vs 245), but still nowhere near the speed I'd like – houbysoft Jul 21 '14 at 21:12
  • @sirlark: update: moving the two constants to local scope didn't help much either – houbysoft Jul 21 '14 at 21:53
  • @houbysoft, what is the shape and size of the arrays? I think you are probably limited by the overhead of calling python functions with such small arrays. For 1D arrays of size 100, you are still about a factor of 10 off of pure C performance. It's not until you start getting arrays with sizes > 10,000 where they are almost equal.+ – user545424 Jul 21 '14 at 22:11
  • @user545424: yeah, they're 1D arrays of size 130 each. I suppose I'll just precompute the distance matrix for all of them, instead of passing a callable to calculate the distance on demand. – houbysoft Jul 21 '14 at 22:14
  • @user545424 NumPy's `norm` has been changed to do exactly that in recent versions (1.8.0, I think). – Fred Foo Jul 22 '14 at 13:12
  • You might try [*this method*](http://stackoverflow.com/a/4299378/23771). That way you can find out if you're calling the low-level function more than you need to. – Mike Dunlavey Sep 03 '14 at 00:08

1 Answers1

1

I am not familiar with what the function does - but is there a possibility of repeated calculations? If so, you could memoize the function:

cache = {}
def dbscan_metric(a,b):

  diff = a - b

  if a.shape[0] != NUM_FEATURES:
    to_calc = diff
  else:
    to_calc = np.multiply(FTR_WEIGHTS, diff)

  if not cache.get(to_calc): cache[to_calc] = np.linalg.norm(to_calc)

  return cache[to_calc]
zallarak
  • 5,287
  • 7
  • 38
  • 54