0

I have two dataframes. One containing the customer's ip address and the other containing store's location.

I would like to generate the distance(using distHaversine) from a customer's ip address to the closest store's location. I imagine applying dishaversine to the customer's lat and long against each store's lat and long, and then using which.min to output the smallest output.

Below is a snapshot of how much data looks like.

    customer_data
customer_id lat long 
1 50 33
2 44 -21
3 129 -22


    store_data
store lat long
1 33 22
2 -111 -139
3 23 30
kizunairo
  • 41
  • 8

1 Answers1

0

This answer was helpful in looking at potential solutions.

The first step would be to create a distance matrix with distm based on both of your data frames. You can select distHaversine method if you wish, with others available.

Then you can determine with store is closest for each customer by max.col (negative sign before mat will check for value that is least negative).

You can also add the distance from the matrix (here in meters).

I made up some example data from U.S. and changed store to A, B, C for clarity in answer.

library(geosphere)

# create a distance matrix
mat <- distm(customer_data[,c('long','lat')], store_data[,c('long','lat')], fun=distHaversine)

# assign the store name to customer_data based on shortest distance in the matrix
customer_data$locality <- store_data$store[max.col(-mat)]

# add distance in km for that store
customer_data$nearest_dist <- apply(mat, 1, min)/1000

Output

  customer_id  lat  long locality nearest_dist
1           1 41.8  87.6        A     313.5497
2           2 40.7  74.0        B     440.4867
3           3 36.8 119.4        C     784.7909

Data

customer_data <- data.frame(
  customer_id = c(1, 2, 3),
  lat = c(41.8, 40.7, 36.8),
  long = c(87.6, 74, 119.4)
)

store_data <- data.frame(
  store = c("A", "B", "C"),
  lat = c(44.5, 44.5, 43.8),
  long = c(88.7, 72.5, 120.5)
)
Ben
  • 28,684
  • 5
  • 23
  • 45
  • I have tried to replicate the effort. My R session would abort after a minute though... My two dataframes contain 689k rows and 26k rows. – kizunairo Feb 11 '20 at 16:00
  • Ah, I see - I didn't realize that. There are number of other answers related to calculating large distance matrices (or other approaches to find closest distance with multi-k rows of points). Perhaps one of these might be helpful: [distm with big memory](https://stackoverflow.com/questions/44313983/r-distm-with-big-memory), or [distm for big data](https://stackoverflow.com/questions/49863185/r-distm-for-big-data-calculating-minimum-distances-between-two-matrices) – Ben Feb 11 '20 at 17:19
  • Also might take a look at this referenced [blog post](https://privefl.github.io/blog/performance-when-algorithmics-meets-mathematics/). – Ben Feb 11 '20 at 18:21