Optimize code for efficiency

Question

I have created the code to perform a mapping of points from one vector to another based on Euclidean distance and have checked that it works correct.

However it is taking too much time. Essentially I have created a matrix for euclidean distance of A and B vectors and found the minimum of it. After I denote the mapping of these points I delete the row and column from the euclidean matrix by marking them as NaN for the next mapping to occur.

Can this code be more efficient because it is awfully slow right now...

Euclid = distance(A,B); % calculates euclid distance column v/s column wise. 
for var = 1 : value
    %# finds the min of Euclid and its position, when Euclid is viewed as a 1D array
    [~, position] = min(Euclid(:)); 

    %#transform the index in the 1D view to 2 indices
    [i,j] = ind2sub(size(Euclid),position);

    %display(strcat(num2str(i),32, num2str(j)));

    mapping = [A(1,i) A(2,i) B(1,j) B(2,j)];
    fprintf(FID,'%d %d %d %d\n', mapping );

    Euclid( i , : ) = NaN;
    Euclid( : , j ) = NaN;
    %counter = counter + 1;
end

The problem is that for a 5000 X 5000 matrix the code just hangs for a long time...

Can somebody help me out...

It is equivalent to 5000... depends on the number of points in the space — anon, Apr 14 '12 at 05:26

score 5 · Accepted Answer · edited May 23 '17 at 11:43

I suppose something to try is reshaping the array of distances into a 1-D array, where you keep careful record of how that new 1-D index will map back into the 2-D index. Then just call a sorting function on the 1-D array, to sort it into ascending order. Make sure to save the permutation of the indices that causes the sort, and then read-off the 2-D array coordinates that correspond to the 1-D coordinates in the sort permutation. In this method, you'll be at the mercy of Matlab's sort algorithms, and you'll need to keep careful track of the index changes from reshaping.

This also poses a problem for removing points from consideration. You would have to keep a running list of indices for points that have already been selected, and if (as you traverse the 1-D permutation) you encounter something that corresponds to an index already selected, then you just skip it (e.g. you don't set it to NaN, you just skip it from consideration).

This saves you from needing to compute a minimum every time. The only real checking at each iteration of a for-loop that traverses the 1-D sort permutation is the logical check whether the current location has been selected before (due to one of its points being involved in a smaller distance location). On iteration i, this check takes at most i-1 comparisons, so this will be slightly less than O(n^2).

This will be faster than your current method, which calculates the min of the whole array every time, but still not as fast as the O(n log n) method mentioned in the link below.

More generally, you want to read the papers linked in the answers to this question. This is not a trivial problem, not well suited for the RAM intensive Matlab sessions, and probably will require you to rewrite your whole approach to this.

Another idea is that if you broadcast all of the array A to a bunch of processors, then this is embarrassingly parallel on the points in array B. You can ship out pieces of B to different processors, and get back a list of all the distances. You'll have to do some processing back on the head node to select out the index for the min-distance and remove those points, but not too much. Probably you can re-write that part anyway so that you don't need to compute the distances multiple times.

So if you write this in, say, Python or C++, you could quickly use the MPI library and then either run it on a cloud cluster at Amazon Web Services, or on a local cluster if you have access.

Also, note the best complexity quoted there is O(n log n), so you can benchmark your code and see how close/far you are to that. For large data, you're going to probably do much worse than this just by virtue of it being in Matlab. — ely, Apr 14 '12 at 02:40
Hey , in the above code the maximum time that is being taken is by finding the minimum point in complete matrix and then setting each row and column to NaN.... Can't those operations be more vectorised than what I did ? — anon, Apr 14 '12 at 02:47
Okay but can you explain what you meant by rewriting that part ? Even now distance is computed only once... — anon, Apr 14 '12 at 02:52
And also would knnsearch in matlab perform better by deleting the matrix and finding knn ? — anon, Apr 14 '12 at 03:03
I don't think you want to do k-nearest neighbors. That won't really represent the distance between the two sets. There might be some optimized variant of knn that can work for this, but my guess is you'd have to build that algorithm yourself and without it being optimized to use library functions compiled in C, etc., it would be too slow. I added a few paragraphs at the beginning of my answer that give a naive way to improve on what you're doing. It's not the greatest, but anything that stops you from computing the min every time will be a big help. — ely, Apr 14 '12 at 03:05

Optimize code for efficiency

1 Answers1