0

I have to calculate the distance of all UK postcodes against each other, then sum the population of all postcodes within 1 mile. The postcode & population list are stored in a text file. I'm most familiar with matlab, but I also have Stata & PSPP available. The program is currently scheduled to take about 2 weeks. Is there anything I can do to accelerate this process??? Here is my code. Matlab generated the script to import the text data. The distance function is from the mapping toolbox, and performs the great circle formula.

Any help is greatly appreciated.

function pcdistance(postcode, pop, lat, lon)
%Finds total population for UK postcode within 1 mile radius


fid = fopen('PPC.txt','a');

n = length(postcode);

%Calculates distance of 1 postcode at a time, against all others
%All data that doesn't meet rules is deleted


for i = 1:n;

    dist = [];
    dist(:,1)= pop;

    for j = 1:n;
        dist(j,2) = distance(lat(i),lon(i),lat(j),lon(j),3963.17);
        good = dist(1:j,2)<= 1;
    end

    dist = dist(good,:);
    tot = sum(dist(:,1));


    fprintf(1,'%s,%d;',postcode{i},tot)

end

%Find sum of population within 1 mile

fclose(fid);

end

Here's a small sample of the inputs, from the txt file. The columns are "postcode, pop, lat, long" respectively.

"BD7 1DB",749,53.79,-1.76
"M15 6AA",748,53.46,-2.24
"WR2 6AJ",748,52.19,-2.24
"M15 6PF",745,53.46,-2.23
"IP7 7RA",741,52.12,0.96
"CF62 4WA",740,51.41,-3.41
"M1 2AR",738,53.47,-2.22
"NG1 4BR",737,52.95,-1.14
"ST16 3AW",735,52.81,-2.11
"AB25 1LE",733,57.15,-2.10
"WF2 9AG",730,53.68,-1.50
"DT11 8RH",730,50.86,-2.12
"CW1 5NP",729,53.09,-2.41
"TR12 7RH",724,50.08,-5.25
"ST5 5DY",723,53.00,-2.27
"HA1 3HP",723,51.57,-0.33
"DL10 7NP",722,54.37,-1.62
"M1 7HR",719,53.47,-2.23
"B18 4AS",719,52.49,-1.93
"OX13 6JB",716,51.68,-1.30

Here's the corrected code.

function pcdistance4(postcode, pop, lat, lon)
%Finds total population for UK postcode within 1 mile radius


fid = fopen('PPC.txt','A');

n = length(postcode);


% Pre-allocation
dist = zeros(n,2);
tot = zeros(n,1);

tic

for i = 1:n;


    dist(:,1)= pop;

    dist(:,2) = distance(lat(i),lon(i),lat(:),lon(:),3963.17);

      good = dist(:, 2) <= 1 & dist(:,2) ~=0;

    tot(i) = sum(dist(good, 1));
    tot(i) = tot(i) + pop(i);

end

toc
tic

for j = 900001:n;
    fprintf(fid,'%s,%d;\n',postcode{j},tot(j));
end

toc

fclose(fid);

end

4 Answers4

3

Just a few general tips to get you started:

  1. For your personal education: run your code using profiler to see where the majority of the computational time is spent. This will be the first clue on where to begin your optimization. (link to doc)

  2. You shouldn't be writing to the hard drive on every step of the loop, since I\O are very costly in computation time. Instead you should save a bunch of strings to memory and write these "chunks" every once in a while. link1 link2

  3. You can try using parfor instead of for(link to doc). Or possibly even CUDA if available to you. (link to doc)

  4. Consider using the Geodetic Toolbox. It could be easier to convert the lat\lng coordinates to UTM (i.e. cartesian) and then use some standard functions to find distances.

Also:

  • My suggestion: don't use i and j as indices for your loops - this is often considered bad practice (due to possible confusion with imaginary numbers).
Community
  • 1
  • 1
Dev-iL
  • 23,742
  • 7
  • 57
  • 99
  • Would be "up" if not not for your last comment on i and j. It's disputable. – LordViaderko Aug 01 '14 at 05:54
  • 2
    @LordViaderko - I'm sorry you feel that way, but I'd rather lose an upvote than my integrity (by editing it out). However, your comment was noted, and I rephrased the wording slightly. – Dev-iL Aug 01 '14 at 06:43
  • Especially agree with 1 and 2. Have posted an answer with a basic approach to show that with the right practices, it shouldn't take that long to do the calculations. – Dennis Jaheruddin Aug 01 '14 at 12:15
1

You should consider memory-complexity as well. Consider pre-allocating the variable dist outside the for-loops and overwrite it element by element.

For example (see comments in the modified function):

function pcdistance(postcode, pop, lat, lon)

fid = fopen('PPC.txt','a');

n = length(postcode);

% Pre-allocation
dist = zeros(n,n);

for i = 1:n;

    % Avoid "deleting" the variable, you can overwrite it as the number of
    % elements is always the same
    % dist = [];
    dist(:,1)= pop;

    for j = 1:n;

        % Unfortunately I do tno have the mentioned toolbox, but there is a
        % high chance that you can avoid the for-loop. Probabily something
        % like:
        %    dist(:, 2) = distance(lat(i), lon(i), lat, lon, ...)
        % Try to vectorize it.
        dist(j,2) = distance(lat(i),lon(i),lat(j),lon(j),3963.17);

        % There is no need for this operation, is highly redundant and
        % computationally expensive:
        %   - in the first loop you will check 1
        %   - in the second loop you will check two elements (1 redundant)
        %   - in the jth loop you will check j elements (j-1 redundant)
        % The total redundant operations are 1+2+3+...+n-1.
        %good = dist(1:j,2)<= 1;
    end

    % better do this
    good = dist(:, 2) <= 1;

    % also memory expensive.
    % dist = dist(good,:);

    % Better do the indexing directly
    tot = sum(dist(good, 1));

end

% Write outside as recommended by Dev-iL

%Find sum of population within 1 mile

fclose(fid);

end
gire
  • 1,105
  • 1
  • 6
  • 16
  • Of course you don't need to get the toolbox for just this question, but I figured it is worth mentioning that the [Geodetic Toolbox](http://www.mathworks.com/matlabcentral/fileexchange/15285-geodetic-toolbox) is shared on File Exchange. (And thus freely available) – Dennis Jaheruddin Aug 01 '14 at 12:17
1

I cannot believe that distance() can only compare two points at a time when there should be no problem in vectorizing some sinus and cosine functions. So here is a trimmed vectorized version which i wrote for my own purposes some time ago. Maybe because i did not have that toolbox or i did not know about it. Frankly it does not give the exact same result as distance(), which i just tested. If you need the exact result of distance() you better not use this vectorized version.

function dist = distance_on_earth(lat0, lon0, lats, lons, radius)
degree2radians = pi/180;

% phi = 90 - latitude
phi0 = (90-lat0)*degree2radians;
phis = (90-lats)*degree2radians;

% theta = longitude
theta0 = lon0*degree2radians;
thetas = lons*degree2radians;

% sperical distance:
cosine = sin(phi0)*sin(phis)*cos(theta0-thetas)+cos(phi0)*cos(phis);
arc = acos(cosine);
dist = arc*radius;

Also in addition to what Dev-iL suggested, you can at least take the following line out of the inner loop:

good = dist(1:j,2)<= 1;

Good luck! Nras

Nras
  • 4,251
  • 3
  • 25
  • 37
0

The uk is so small that you can still get reasonable results without worrying about the curving of the earth. You can simply estimate the distance by using the difference in latitude and longtitude.

This example is a bit oversimplified, but would suggest that once the data is read in, you could do the actual calculations in under an hour.

x=rand(1.7e6,1);                %Fake x data
y=x;                            %Fake y data
tic
for t=1:1.7e3                   % One thousandst part of the work to be done
    (x-0.5).^2+(x-0.2).^2>0.01; %Simple distance calculation from a point (0.5,0.2), then comparing to treshold
end
toc                             %Runs for about 2 seconds

Using the real distance may take a bit longer, but it should still not take more than 1 or 2 hours to complete.

Dennis Jaheruddin
  • 21,208
  • 8
  • 66
  • 122