I am working with the R programming language.
Suppose I have the following two data frames:
set.seed(123)
df_1 <- data.frame(
name_1 = c("john", "david", "alex", "kevin", "trevor", "xavier", "tom", "michael", "troy", "kelly", "chris", "henry", "taylor", "ryan", "peter"),
lon = rnorm(15, mean = -74.0060, sd = 0.01),
lat = rnorm(15, mean = 40.7128, sd = 0.01)
)
df_2 <- data.frame(
name_2 = c("matthew", "tyler", "sebastian", "julie", "anna", "tim", "david", "nigel", "sarah", "steph", "sylvia", "boris", "theo", "malcolm"),
lon = rnorm(14, mean = -74.0060, sd = 0.01),
lat = rnorm(14, mean = 40.7128, sd = 0.01)
)
My Problem: For each person in df_1, I am trying to find out the 5 closest people (haversine distance) to this person in df_1 and record various distance statistics (e.g. mean, median, max, min standard deviation).
Here is my own attempt at solving this problem.
First I defined a function that calculates the distance between each pair of points:
## PART 1
library(geosphere)
haversine_distance <- function(lon1, lat1, lon2, lat2) {
distHaversine(c(lon1, lat1), c(lon2, lat2))
}
Then, I used a loop to calculate all comparasions:
## PART 2
# Create a matrix to store results
distances <- matrix(nrow = nrow(df_1), ncol = nrow(df_2))
# calculate the distances
for (i in 1:nrow(df_1)) {
for (j in 1:nrow(df_2)) {
distances[i, j] <- haversine_distance(df_1$lon[i], df_1$lat[i], df_2$lon[j], df_2$lat[j])
}
}
# Create final
final <- data.frame(
name_1 = rep(df_1$name_1, each = nrow(df_2)),
lon_1 = rep(df_1$lon, each = nrow(df_2)),
lat_1 = rep(df_1$lat, each = nrow(df_2)),
name_2 = rep(df_2$name_2, nrow(df_1)),
lon_2 = rep(df_2$lon, nrow(df_1)),
lat_2 = rep(df_2$lat, nrow(df_1)),
distance = c(distances)
)
Finally, I kept the 5 minimum distances per person:
## PART 3
# Keep only first 5 rows for each unique value of final$name_1
final <- final[order(final$name_1, final$distance), ]
final <- final[ave(final$distance, final$name_1, FUN = seq_along) <= 5, ]
# Calculate summary statistics for each unique person in final$name_1
final_summary <- aggregate(distance ~ name_1,
data = final,
FUN = function(x) c(min = min(x),
max = max(x),
mean = mean(x),
median = median(x),
sd = sd(x)))
final_summary <- do.call(data.frame, final_summary)
names(final_summary)[-(1)] <- c("min_distance", "max_distance", "mean_distance", "median_distance", "sd_distance")
final_summary$closest_people <- tapply(final$name_2,
final$name_1,
FUN = function(x) paste(sort(x), collapse = ", "))
# break closest_people column into multiple columns
n <- 5
closest_people_split <- strsplit(final_summary$closest_people, ", ")
final_summary[paste0("closest_", seq_len(n))] <- do.call(rbind, closest_people_split)
My Question: The above code seems to work, but I am interesting in trying to improve the speed of this code (i.e. PART 2) when df_1 and df_2 become very large in size. As such, I am looking into options involving parallel computing using functionalities such as doParallel, parLapply, SNOW, etc.
As I am not overly familiar with this, I tried to look into an option with the doParallel (https://www.rdocumentation.org/packages/parallel/versions/3.4.1/topics/mclapply) library:
library(parallel)
distances <- matrix(nrow = nrow(df_1), ncol = nrow(df_2))
# calculate the distances
distances <- mclapply(1:nrow(df_1), function(i) {
sapply(1:nrow(df_2), function(j) {
haversine_distance(df_1$lon[i], df_1$lat[i], df_2$lon[j], df_2$lat[j])
})
})
The code seems to have run - but I am not sure if what I have done is correct and if this is actually improving the speed of this code.
Can someone please show me how to do this? Is my parallel attempt correct?
Thanks!