0

I am predicting latitude and longitude coordinates. When I predict for example the latitude coordinate, I want to compare this prediction to another variable which contains the cluster centroids of the clusters I made for the latitude and longitude. I want to return the cluster (which I have in another variable) of the cluster centroid closest to the predicted latitude coordinate. I do have the right setup due to another post on Stackoverflow, but I don't get the right cluster as an answer. Can someone help me to see what I did wrong?

I want the 'predclustertest' variable to contain the cluster (ClusterEnd) that belongs to the ClusterEndLatitudeCenter which is closest to the prediction of the latitude (predictions_test)

df <- dfTraining %>%
group_by(TripID) %>%
mutate(pred_cluster_test = case_when(ClusterEnd_LatitudeCenter == predictions_test ~
ClusterEnd[ClusterEnd_LatitudeCenter],TRUE ~ ClusterEnd[sapply(ClusterEnd_LatitudeCenter,
function(x) which.min(x - predictions_test))]))

This is what the data looks like:

structure(list(EndLatitude = c(38.26, 38.218, 38.255, 38.258, 
38.213, 38.215), EndLongitude = c(-85.75, -85.754, -85.746, -85.751, 
-85.751, -85.757), ClusterEnd = c(1, 4, 1, 5, 4, 4), ClusterEnd_LatitudeCenter = c(38.25629, 
38.21723, 38.25629, 38.25322, 38.21723, 38.21723), ClusterEnd_LongitudeCenter = c(-85.74133, 
-85.75955, -85.74133, -85.75783, -85.75955, -85.75955), predictions_test = c(`1` = 38.2407296518939, 
`2` = 38.2326115950784, `3` = 38.2428487622735, `4` = 38.2449069816005, 
`5` = 38.234314694847, `6` = 38.2347388488934), pred_cluster_test = c(38.25629, 
38.21723, 38.25629, 38.25322, 38.21723, 38.21723)), row.names = c(NA, 
-6L), class = c("tbl_df", "tbl", "data.frame"))
Klaart
  • 13
  • 3
  • we need example data of the predicted points and the centroids to get the required structure. Which distance to the centroids do you need? Harversine? Euclidean? – danlooo Mar 30 '22 at 11:10
  • @danlooo Thanks for your answer. I want to use the Euclidean distance for the centroids, but for now I just tried the minimal absolute value. I inserted a picture of my data! – Klaart Mar 30 '22 at 13:22
  • See [here](https://stackoverflow.com/questions/49994249/example-of-using-dput) on how to provide example data. Is there a reason why you are using Euclidean distance instead of Haversine? – hrvg Mar 30 '22 at 14:15
  • @hrvg Thanks! Yes because I used K means clustering and the datapoints are from one city and not from all over the world. Therefore, euclidian distance is okay to use. But that is how I determined the clusters, for now I just hope that someone can help me with the absolute minimal distance. – Klaart Mar 30 '22 at 14:23

1 Answers1

0

Provided that I understand correctly what is expected the following may work:

library(dplyr)

foo <- function(x, cluster_coords) {
  mat <- cbind(x, cluster_coords)
  distance <- apply(mat, MARGIN = 1, FUN = dist, method = "euclidean")
  which.min(distance)
}

df %>% 
  mutate(
    cluster_pred_test = ClusterEnd[
    sapply(
      predictions_test,
      function(x) foo(x, ClusterEnd_LatitudeCenter)
      )
    ]
  ) %>%
  pull(cluster_pred_test)
[1] 5 4 5 5 4 4

You may want to edit this to include both your coordinates, and look into the dplyr::group_map and dplyr::group_modify functions which may help you achieve efficient, grouped operations.

hrvg
  • 476
  • 3
  • 6
  • thank you so much, I am trying this right now. However, It takes a long time to run. Do you know any way to make this faster? I can't figure out a way how to myself.. – Klaart Mar 30 '22 at 17:05
  • 1
    My guess is that you have a lot of TripID. If that's the case, you case you partition the `data.frame` with `multidplyr`: https://multidplyr.tidyverse.org/ – hrvg Mar 30 '22 at 21:05
  • Do you mean to get the tripid out of the dataframe? And where do you do this? – Klaart Mar 31 '22 at 04:24
  • Sorry, If you have time would you like to explain it to me. I get this error 'Error in partition(., TripId) : could not find function "partition" But i installed everything correctly. – Klaart Mar 31 '22 at 08:10
  • 1
    I am not familiar with this particular error. You can also work with the `foreach` and `doParallel` packages to perform parallel operations at a lower level of abstraction than `multidplyr`. As this is not related to your current question, you should probably accept the answer if it is an adequate solution and open another, thoughtful questions for your other problem. A quick web search leads to multiple tutorials (e.g., [R Bloggers](https://www.r-bloggers.com/2017/01/speed-up-your-code-part-2-parallel-processing-financial-data-with-multidplyr-tidyquant-2/)) – hrvg Mar 31 '22 at 14:30