0

I have a large dataset (df) (with 300,000 houses) which I have the longitude and latitude for each observation. Below (df1) is the first 10 observations of the data:

df1 <- read.table(sep=",", col.names=c("lat", "lon"), text="
53.543526,-8.047727
51.88029, -9.583830
52.06056, -9.488551
51.87087, -9.577604
51.89530, -8.454321
51.95688, -7.851760
53.37621, -6.392430
53.37719, -6.234660
51.88029, -9.583830
51.88145, -9.600894")

Firstly, I tried to compare my dataset (all 300,000 observations) to one data point using the below (Calculate distance between two long lat coordinates in a dataframe):

centre = c(53.543526, -8.089727)
distHaversine(df, centre)
# and
distm(df, centre, fun = distHaversine)

But I kept getting the error:

Error in .pointsToMatrix(x) : latitude < -90

I have two questions:

  1. How do I calculate the distance from each of my 300,000 observations in dataframe 'df' to the 'centre' datapoint

  2. Say I want to calculate the distance of each house to a list of schools (a smaller yet large dataset - in the hundreds) (for example df2 below). How do I calculate the distance of each house to each school, and then keep the minimum distance?

Example school dataset:

df2 <- read.table(sep=",", col.names=c("lat", "lon"), text="
53.38271, -6.437433
53.34874, -6.131537
53.34449, -6.266856
53.34424, -6.267444
53.34648, -6.261414
53.64333, -8.208663")

Thanks in advance!

PMc
  • 95
  • 10

3 Answers3

1

Use distm function from the geosphere package, it calculates distances between every point between two matrices, where each row represent the df1 objects and the columns represent df2 objects:

library(geosphere)
distm(df1, df2)

            [,1]      [,2]       [,3]       [,4]       [,5]      [,6]
 [1,] 178968.962 213003.58 198172.550 198110.991 198746.488  20923.34
 [2,] 385376.082 414721.59 400788.464 400717.802 401428.071 246442.51
 [3,] 367573.615 397518.53 383398.252 383327.609 384038.877 224390.48
 [4,] 385203.033 414495.46 400578.857 400508.198 401218.340 246836.89
 [5,] 276963.269 302892.13 290037.267 289967.750 290660.977 194456.76
 [6,] 221966.904 244628.53 232857.426 232790.237 233455.843 190049.84
 [7,]   5028.478  29011.20  14323.587  14267.385  14857.496 203015.38
 [8,]  22432.536  11830.79   5076.573   5141.969   4505.897 220278.46
 [9,] 385376.082 414721.59 400788.464 400717.802 401428.071 246442.51
[10,] 387024.885 416408.72 402463.993 402393.330 403103.685 247508.26

As for the error ou mentioned, I am not getting any error while using distm

distm(df1, centre, fun = distHaversine)
            [,1]
 [1,]   4675.419
 [2,] 247250.726
 [3,] 225526.648
 [4,] 247555.321
 [5,] 186051.181
 [6,] 176912.553
 [7,] 189843.467
 [8,] 207320.670
 [9,] 247250.726
[10,] 248435.392
Felipe Alvarenga
  • 2,572
  • 1
  • 17
  • 36
  • Hi Felipe - sorry if it is not clear but df1 is just the first 10 observations of my 300,000 observations. If I extend this to all 300,000 observations I then get this error. So I think the above will not work if I want to calculate the distance of 300,000 houses to 500 schools. – PMc Mar 22 '18 at 20:07
  • depends on the memory you are working with. See this question for example https://stackoverflow.com/questions/46004625/how-to-compute-big-geographic-distance-matrix – Felipe Alvarenga Mar 22 '18 at 20:12
  • Thanks. The code from that page this page will not work for me either. So is this a case that my dataset is too large and it will not work. As that is worrying... – PMc Mar 23 '18 at 10:01
1

I had a similar problem. The issue was that the longitude and latitude were character columns. Converting them to numeric columns resolved the issue.

Nikhil Gupta
  • 1,436
  • 12
  • 15
-1

I think you have a bad latitude coordinate. Your error says there is a latitude < -90, which is not possible. Minimum latitude is -90. Do something like this to check for bad points:

badPoints <- which(df1$lat < -90)
print(df1[badPoints,])

Run this to remove the bad points:

goodDf1 <- df1[(df1$lat >= -90 & df1$lat <= 90),]
Justin Braaten
  • 711
  • 7
  • 13
  • Hi, unfortunately this is not the case and none of the observations meet this criteria – PMc Mar 22 '18 at 20:24
  • @PMc Bummer - that would have been an easy fix. You could try looping through each row, calculating distance, until you hit the error and inspect the row it crashed on for weirdness. – Justin Braaten Mar 22 '18 at 20:33