0

I have two data frames in which observations are geographic locations defined by a latitude/longitude combination. For each point in df1 I would like to get the closest point in df2 and the associated value. I know how to do that by computing all the possible distances (using e.g. the gdist function from the Imap package) and getting the index for the smallest one. But the fact is that it is at best excessively long as df1 has 1,000 rows and df2 some 15 millions.

Do you have an idea of how I could reach my goal without computing all the distances? Maybe there is a way to limit the necessary calculations (for instance using the difference in latitude/longitude values)?

Thanks for helping,

Val

Here's what df1looks like:

   Latitude Longitude
1  56.76342  8.320824
2  54.93165  9.115982
3  55.80685  9.102455
4  57.27000  9.760000
5  56.76342  8.320824
6  56.89333  9.684435
7  56.62804  8.571573
8  56.64850  8.501947
9  55.40596  8.884374
10 54.89786 11.880828

then df2:

   Latitude Longitude       Value
1  41.91000 -4.780000       40500
2  41.61063 14.750832       13500
3  41.91000 -4.780000        4500
4  38.70000 -2.350000       28500
5  52.55172  0.088622        1500
6  39.06000 -1.830000       51000
7  41.91000 -4.780000       49500
8  48.00623 -4.389639       12000
9  56.24889 -3.666940       27000
10 42.72000 -3.750000       49500
vaaaaaal
  • 29
  • 4
  • The simple way: use a GIS enabled database (e.g. postGIS plugin of postgreSQL). You may also use GIS software (your task is one of the most common in GIS, beside visualization). Else you "index" the data: you build sectors, and you check only the data in the same sector, or ev. you search the nearby sectors, and do the search only there. [ev. with more levels]. For sure there are better algorithms. I always preprocess data – Giacomo Catenazzi Nov 27 '20 at 12:21
  • This might start you in the right direction: https://stackoverflow.com/questions/57525670/find-closest-points-from-data-set-b-to-point-in-data-set-a-using-lat-long-in-r/57526673#57526673 – Dave2e Nov 27 '20 at 15:21
  • Another option depending on your application and desired accuracy is round the latitude and longitude down from 5 digits of precision to 2 or 3 digits and then group the matching starting points together. – Dave2e Nov 27 '20 at 15:32
  • Thanks for it! I managed to do it by using the `raster` package. I got the regions corresponding to each point in each database and then computed the distances between each particular point from `df1` and the points in `df2` associated with the same region. – vaaaaaal Dec 04 '20 at 16:12

2 Answers2

1

Split the second frame into chunks of equal size

Then search only the chunks within the reasonable distance of your point. You will be basically drawing a checkerboard on a map. Your point will be within one of these squares - so you will search only that one and few neighboring ones to be safe.

Naive brute force search is rows(df1) * rows(df2). In our case 1000 * 15M, making for 15G operations times the computation time per operation.

So how do we split the data into chunks?

  1. sort by latitude
  2. sort by longitude
  3. take equaly spaced chunks

Sort will take some Nlog(N) operations. N is 15M in our case so these two sorts will take ~2415M2 operations. Splitting in the chunks is then linear ~15M operations, maybe few times.
when you have this separation done, in each chunk you have total_points/(chunk_side ^ 2) points, assuming that your points are distributed equally. The number of the chunks is proportional to the size of the chunk in the beginning: total_area/(chunk_side ^ 2).

Ideally you want to balance the number of chunks with the number of points in each chunk so that both are ~ sqrt(points_total).

Each of the thousand searches will now take only chunk_count + points_in_chunk * 9 (if we want to be super safe and search the chunk our point lands in and all the surrounding ones.) So instead of 1000 * 15M you now have `1000 * (sqrt(15M) *18) ~ 1000 * 16K, an improvement by a factor of 50.

Note that this improvement will grow if the second set gets larger. Also the improvement will be smaller, if you choose the chunk size poorly.
For further improvement, you can iterate this once or twice more, making chunks in chunks. The logic is similar.

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
Shamis
  • 2,544
  • 10
  • 16
0

The distm function of geosphere package will help you:

# Make sure to put longitude first and then latitude:
df <- df %>%  select(Longitude,Latitude) 

library(geosphere)
distm(as.matrix(df), as.matrix(df), fun=distGeo)

Remenber, the distm function accepts matrix class objects. You will obtain a 10x10 matrix of distances.

Marcos Pérez
  • 1,260
  • 2
  • 7