0

I have 2 data.frames with zipcodes. One is of houses and the other is of weather stations. I need to merge the 2 data.frames by proximity of the zipcodes, therefore I have the temperature data of the weather stations merged with the houses closest to it. I could not merge by the number of the zipcodes because it is not the same on the 2 datas.frames, as some houses do not have weather stations in the same zipcode.

Is that possible in R?

Thank you very much.

  • If you could please share a reproducible example, that would help to address your question. I recommend the `merge` function in R; you can choose to retain all values of houses or weather stations or their intersection in the merged data. https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – TJ87 Sep 24 '19 at 17:43
  • The closest thing you have to a location for houses and weather stations is zip code? Can't you get lat/lon? Any how, reproducible example... – cory Sep 24 '19 at 17:44
  • Are you trying to get the closest weather station to the house? If you don't have lat/lon this post might help : https://stackoverflow.com/questions/17361909/determining-the-distance-between-two-zip-codes-alternatives-to-mapdist – Mike Sep 24 '19 at 17:46
  • I do not know how to make a reproducible example because my data is confidential. But it is really simple. One column is the houses and the other the zipcodes. And the data. frame of the stations data has the station name, number, zipcodes, date and temperature. – Ana Paula Franzoni Sep 24 '19 at 17:52
  • cory, yes. I only have lat and long of the stations, not from the houses. – Ana Paula Franzoni Sep 24 '19 at 17:53
  • Mike, I saw this one, thank you. But my data has more than 77 millions rows, so I will not be able to match them by distance. – Ana Paula Franzoni Sep 24 '19 at 17:54
  • You can make a reproducible example by giving us a few rows offake data in the same structure. Something like `houses = data.frame(zip = c(98103, 98195, 90210))` and `weather_stations = data.frame(zip = c(98103, 98115, 77079))`. – Gregor Thomas Sep 24 '19 at 19:36
  • I'm confused when you say in a comment "*I will not be able to match them by distance*". Proximity means nearness in distance, so it seems like you are asking to match them by distance... I would suggest first making a list of all house zip codes that are not in the weather stations, and then using Mike's link to find the closest weather station zip code for each. – Gregor Thomas Sep 24 '19 at 19:39
  • I am sorry Gregor, I meant one by one, as I assumed by Mike's link, because my data is big. – Ana Paula Franzoni Sep 25 '19 at 02:23

1 Answers1

0

This is based on the assumption that the list of weather stations is small and hence won't affect computing time:

library(dplyr)
library(purrr)

df_with_homes_stats = df_with_homes %>% 
  dplyr::mutate(closestZip = purrr::map_dbl(.x=homeZip,~df_with_stats$stationZip[which.min(abs(.x-df_with_stats$stationZip))])) %>% 
  dplyr::left_join(df_with_stats,by=c('closestZip'='stationZip'))

Here, 'df_with_homes' is the large dataset with zipcodes of each home and 'df_with_stats' corresponds to the zipcodes of the stations. The underlying assumption here the distance between zip codes is proportional to the mathematical difference between them.

Let me know if it works.

  • @makeshift_programmer I am sorry for the delay. But I could not apply your code yet, because I got the zipcode of the stations using the function revgeo (because I just had the lat and long) and the column with the zipcodes seems indexed with others, so I could not put it's name after the $. I am trying to solve this and let you know. – Ana Paula Franzoni Sep 27 '19 at 02:14