3

I am trying to improve the performance of the OPTICS clustering algorithm. The implementation i've found in open source makes a use of a for loop for each sample and can run for hours...

I believe some use of repmat() function may aid in improving its performance when the system has enough amount of RAM. You are more than welcome to suggest other ways of improving the implementation.

Here is the code:

x is the data: a [mxn] array where m is the sample size and n is the feature dimensionality, which is most of the time significantly greater than one.

[m,n] = size(x);

for i = 1:m
    D(i,:) = sum(((repmat(x(i,:),m,1)-x).^2),2).';
end

many thanks.

Divakar
  • 218,885
  • 19
  • 262
  • 358
user2324712
  • 435
  • 3
  • 13
  • The for loop might not be the problem. Have you tried the profiler to see where the bottleneck is? I would recommend using small version of x and separate each function in the line. Then the profiler will tell you what part is taking the most time. – Matt Jul 07 '15 at 23:06
  • Also you are creating a row of all zeros, then self squaring, then summing. It may be faster to use 2 loops and avoid repmat in favor operating over the non-zero rows. Also do you preallocate the D array? – Matt Jul 07 '15 at 23:12
  • To get performance, you really want to use an index. Try the ELKI version, and enable a k-d-tree or similar index, then run OPTICS with an epsilon just large enough. You'll be surprised by the performance difference! I have only Octave not Matlab, but ELKI was somewhere between 100x and 1000x faster. – Has QUIT--Anony-Mousse Jul 08 '15 at 06:40
  • Did [`this posted solution`](http://stackoverflow.com/a/31308903/3293881) help out in reducing runtime? – Divakar Jul 10 '15 at 14:23
  • please see my reply below – user2324712 Jul 10 '15 at 20:20

1 Answers1

2

With enough RAM to play with, you can use few approaches here.

Approach #1: With bsxfun & permute -

D = squeeze(sum(bsxfun(@minus,permute(x,[3 2 1]),x).^2,2))

Approach #2: With pdist & squareform -

D = squareform(pdist(x).^2)

Approach #3 With matrix-multiplication based euclidean distance calculations -

xt = x.';  %//'
[m,n] = size(x);
D = [x.^2 ones(size(x)) -2*x ]*[ones(size(xt)) ; xt.^2 ; xt];
D(1:m+1:end) = 0;

For performance, my bet would be on approach #3!

Community
  • 1
  • 1
Divakar
  • 218,885
  • 19
  • 262
  • 358
  • Hi, First of all, thank you for your reply. I've tested all your methods, they all give the same result but alas, it is different than the one obtained by the original algorithm. I will try to debug it, maybe you can try the same. I can attest that they are indeed significantly faster, about 100% computationally wise. – user2324712 Jul 10 '15 at 20:15
  • I +1d your linked #3 approach post btw, because it was informative. – user2324712 Jul 10 '15 at 20:25
  • OK, messed with it some more. Version 1 gives the exact same result, version 2 give a negligable error. Both versions run about the same time as my version. Version 3 is indeed the fastest but still produces an error. I am looking further into it. – user2324712 Jul 10 '15 at 20:34
  • got version 3 to work with negligible difference compared to the dataset. The reason for error was that I had to guess what you meant by the variable 'M' which I did not use in the description (i guessed it was 'm' and it was 'x'). Only issue left is that the first column produces small negative numbers where it should be equal to 0. I zeroed it out manually but I will look into this more deeply and understand your method. Thank you. – user2324712 Jul 10 '15 at 20:53
  • @user2324712 Sorry yeah, that `M` must be confusing. Basically those diagonal values were slightly off from exact zeros, so this line `D(1:m+1:end) = 0;` corrects it . Edited approach #3 to correct that confusing `M` part. Please check it out! – Divakar Jul 11 '15 at 05:13