Efficient algorithm to calculate pairwise distances for a large set of locations in matlab

Question

I recently got a problem when calculating the spherical distance with large set of locations in matlab. Here is the problem:

1.locations are latitudes and longitudes, stored in a vector locs with dimension nx2 where n = 10^5.

2.I want to calculate the spherical distance between any two locations in the vector locs. Note that the resulted distance matrix may or may not sparse, since I cannot check it without distance. Let's assume that it's sparse, because we can truncate the distance with a number, say 1000. (The units are km), and the resulted distance matrix is enough for my usage.

3.I tried two methods to calculate it, but each of them has drawbacks.

PS: I run the code on Mac OS with MATLAB 2014 student version. The function great_circle_distance() is pretty like Matlab built-in function calculating the great circle distance between any two locations with geospatial coordinates. The problem here is not relevant to usage of this function.

Thanks for all kinds of suggestions in advance.

--Disc

Method1:

dist = zeros(n, 1);
dist_vec = [];
dist_id = [];
dist_id2 = [];
for i =1:n
dist = great_circle_distance(locs(i, :), locs);
dist_temp=(dist<300) &(dist>0);   % for example, consider the distance between 0 and 300
[dist_id_temp, dist_id2_temp,dist_vtemp]=find(dist_temp);
dist_vec_temp=dist(dist_id_temp);
dist_id=[dist_id; dist_id_temp];
dist_id2=[dist_id2; dist_id2_temp];
dist_vec=[dist_vec; dist_vec_temp];

if mod(i, 1000) == 0

    i

end
end
dist_mat = sparse(dist_id, dist_id2, dist_vec);

The drawback here is that it took long time to finish this, at least 48 hours, but it consumed not too much memory around 2-5G.

Method2:

d = pdist(locs, @great_circle_distance);
[i, j] = find(tril(true(n), -1)); % extract index below main diagonal
d = d';
a = [i, j, d];
con = d == 0;   % specify condition. 
a(con, :) = [];   % delete rows satisfying condition con.
clear i j d;
i = a(:, 1);
j = a(:, 2);
d = a(:, 3);
dist_mat = sparse(i, j, d);

The drawback of this method is that it consumed too much memory (exceeding 15G) when calculating d in the code.

One solution. Cover the entire sphere by overlapping bounding boxes whose sides are all at least 600 miles long. Put each point in all bounding boxes it belongs in. Calculate and filter pairwise distances within each bounding box. Collect and deduplicate all pairs of points found (a given pair can be in 2 bounding boxes, and so could show up twice). — btilly, Sep 23 '14 at 19:09

Amro · Answer 1 · 2014-09-26T12:58:17.617

1

1) Method #1

You are being very inefficient in the way you build the intermediate variables (arrays changing size and such). Consider this improved implementation:

% given some data (longitude/latitude)
locs = rand(N,2);    % N = 10^5 in your case

% pick an appropriate size (guesstimate the number of nonzeros in the sparse mat)
nzmx = ...;

% compute pairwise-distance matrix (only those satisfying some condition)
D = spalloc(N, N, nzmx);
for i=1:N
    d = great_circle_distance(locs(i,:), locs);
    idx = find(100<d & d<400);   % distances in the range [100,400] km
    D(idx,i) = d(idx);
end

Note that for N=10^5 entries, the full distance matrix of type double would consume: N*N*8 bytes, or approximately ((10^5)^2*8)/2^30 = 74.5 GB of memory (that's gigabytes)! So obviously I'm assuming you have enough RAM to hold the sparse matrix even after applying your condition, otherwise the computation cannot be done outright (in that case you'll have to break it into chunks, and compute one piece at-a-time). Say only 5% of the matrix is non-zero, that would be like 4GB of contiguous memory. So you have to choose an appropriate value for nzmx in the above code.

You mentioned that the distance function great_circle_distance is not relevant to the discussion here. Just be sure that it is properly vectorized, otherwise it could affect performance severely..

Update: Here is one possible improvement by only filling the lower half of the sparse matrix:

D = spalloc(N, N, nzmx);
for i=1:N-1
    d = great_circle_distance(locs(i,:), locs(i+1:end,:));
    idx = find(100<d & d<400);
    if isempty(idx), continue; end
    D(idx+i,i) = d(idx);
end

2) Method #2

Obviously since you're using pdist, it will compute the whole matrix (well only half of it). So you gotta make sure you have enough memory to store it (about ((N*(N-1)/2)*8)/2^30 = 37.25 GB according to my calculations).

edited Sep 26 '14 at 12:58

answered Sep 23 '14 at 20:14

Amro

123,847
25
243
454

Thanks for your suggestions. I tried your suggestion for method #1, it kept running for the whole night, but it is a little bit faster than previous one. I think I need to run the for-loop in parallel. BTW, great_circle_distance function is just a transformation from geospatial coordinate to great circle distance on the surface of Earth(assuming it's a sphere) without any loops and indexing. – Bayes Sep 24 '14 at 12:42
@Disc: what did you choose for `nzmx`? If the value is too small, the matrix will grow each time it needs more space, causing it to [reallocate](http://stackoverflow.com/q/21023171/97160) over and over which is expensive.. – Amro Sep 24 '14 at 13:01
I set nzmx = floor(1e-4*n^2)=42949935, but I have no idea about how sparsity of the distance matrix should be, since I want to create the covariance matrix using the distance. If I set nzmx to a very small integer, it may not be helpful for my usage. If I understand spalloc() correctly, when I set nzmx in spalloc, the memory for the matrix D will not be reallocated. BTW, I have an idea that running the for-loop from i = N:-1:1, and only calculating the lower triangular part of the distance matrix. – Bayes Sep 24 '14 at 19:32
@Disc: actually sparse matrices (like all other array types in MATLAB) will dynamically expand capacity if they need more space (so the underlying `pr`, `ir`, and `jc` arrays will be reallocated). Of course if you do this repeatedly then it comes at a performance cost (and memory fragmentation). This is why it's important to choose an appropriate value for `nzmax` big enough to reduce the number of reallocations, but not too big as to overshoot and reserve more memory that it actually needs (as we say, in this case the full matrix is probably larger than total available RAM) – Amro Sep 26 '14 at 12:38
now you're right, we can further improve the first code by only filling half of the distance matrix (lower or upper part), shouldnt be too hard to implement it. Note: I dont think it would matter if you use `for i=1:N` or `for i=N:-1:1` here, the matrix is already preallocated at the start.. – Amro Sep 26 '14 at 12:40

Efficient algorithm to calculate pairwise distances for a large set of locations in matlab

Method1:

Method2:

1 Answers1

1) Method #1

2) Method #2