3

everyone. I have a 3-dimensional data point matrix called "data", which has a dimension of N*3. Right now, I am trying to get two values:

First, the indices "m" and "n" of a distance matrix "Dist", where

Dist = squareform(pdist(data));

Such that

[m,n] = find( Dist<=rc & Dist>0 );

where "rc" is a certain cutoff distance, "m" is the row index, and "n" is the column index.

Second, the conditional distances "ConDist", where

ConDist = data( pdist(data)<=rc & pdist(data)>0 );

This code works fine for small sized "data" (where N < 3500), however, for large "data" (N > 25000), this process takes too much time/memory. Therefore, I tried to minimize time/memory by doing the following:

Dist = zeros(size(data,1));
Dist(tril(true(size(data,1)),-1)) = pdist(data);
[m,n] = find(Dist <= rc  &  Dist > 0);
ConDist = Dist(Dist <= rc  &  Dist > 0);

Here, I calculated only the lower triangle side of the "squareform" command to reduce calculation time (or memory, I don't know how MATLAB will find this code much simpler). However, it seems like it still takes a lot of time/memory to calculate the "Dist" variable

Would there be a faster/less-memory-consuming way to calculate "m","n", and "ConDist"? Thank you very much in advance.

Divakar
  • 218,885
  • 19
  • 262
  • 358
JimC
  • 33
  • 2
  • `pdist` doesn't calculate each distance twice. That is, it calculates only the lower triangle. `squareform` just reshapes and copies that triangle to generate the square matrix. You should first determine which is taking such large time: `pdist` or `squareform` (probably the first). – Luis Mendo Feb 19 '15 at 12:59
  • Thank you for your reply Luis Mendo. Actually, I figured that squareform does consume a lot of time, so I have changed my code using Dist(tril(true(size(data,1)),-1)) = pdist(data). However, now it turns out that pdist takes a lot of time/memory. – JimC Feb 19 '15 at 13:05
  • Are you performing clustering? In some cases you might not need to compute **all** pairwise distances if you use a [clever data structure](http://en.wikipedia.org/wiki/Category:Trees_(data_structures)) instead: AABB-tree, octree, kd-tree,... – knedlsepp Feb 19 '15 at 13:31
  • Did you have a chance to test out the solution posted here? – Divakar Feb 20 '15 at 09:23
  • Thank you for your reply, knedlsepp. To be honest, I'm not sure if this is a clustering analysis or not... I checked out the link you sent me, but there was a lot of clustering analyses... I think I might have to look up more. Thank you very much! – JimC Feb 21 '15 at 21:03

1 Answers1

0

This could be one approach -

N = size(data,1); %// datasize

%// Store transpose of data, as we need to use later on at several places
data_t = data.'  %//'

%// Calculate squared distances with matrix multiplication based technique
sqdist = tril([-2*data data.^2 ones(N,3)]*[data_t ; ones(3,N) ; data_t.^2])

%// Logical array with size of distance array and ones that are above threshold
mask_dists = sqdist <= rc^2  &  sqdist > 0

%// Indices & distances from distances array that satisfy thresholding criteria
[m,n] = find(mask_dists)
ConDist = sqrt(sqdist(mask_dists))

You can introduce bsxfun here to replace tril (keeping rest of it as it is) and see if that speeds it up a bit further -

sqdist = [-2*data data.^2 ones(N,3)]*[data_t ; ones(3,N) ; data_t.^2]
mask_dists = sqdist <= rc^2  &  sqdist > 0 & bsxfun(@ge,[1:N]',1:N)
Divakar
  • 218,885
  • 19
  • 262
  • 358
  • Thank you for your answer, Divakar. I actually tried it out, and it did get faster! Before, it took 1.36 s, and now, it only takes 0.6 s! (This timescale is based on the trial dataset, which only has N < 3000) I do have one concern though. The ConDist variable has a slight value difference b/w the one from a previous approach and that of the newly introduced one. I'm guessing this might be due to is the sqrt eqn. Could there be a way to solve this problem? – JimC Feb 21 '15 at 21:08
  • @JimC It could be yes! How, big is the difference? `max` of the difference between the two results might tell you that? – Divakar Feb 21 '15 at 21:27
  • The maximum difference came out to be 3.4435e-12, and the standard deviation value came out as 2.7056e-13. The differences seem small enough to be ignored, however, I wish that the new approach could provide the same answer that was given from the previous approach. – JimC Mar 02 '15 at 01:03
  • @JimC That will be there! These are the limitations of the floating point arithmetic! – Divakar Mar 02 '15 at 04:06