1

I have 2 dataframes, one (df1) with latitudes and longitudes of a specific place, and one (df2) with latitude of longitude of a city with the city name. I want to (left) join the city name from df2 as a new column in df1 based on the nearest latitude/longitude.

In other words, I want to compare the latitude and longitude of df1 with those of df2 and then join the corresponding city from df2 on df1.

df1

amount latitude longitude
1 20 2
4 18 8
3 20 10
6 5 1

df2

City latitude longitude
A 16 8
B 12 12
C 20 15
D 4 28
r2evans
  • 141,215
  • 6
  • 77
  • 149
  • 1. Please provide the ```dput()``` of the data. 2. What do you mean when you say the nearest? Nearest euclidean distance? Can you provide the expected output? – Shibaprasadb Dec 23 '21 at 14:45
  • 1
    "Nearest" is likely going to fail you, I suggest you need to set an upper-limit on the distance between them. If you're really using gps coordinates, you should likely use a Haversine distance (found in `geosphere` among others). If that's not important, then there are several other questions on SO about this, perhaps https://stackoverflow.com/q/59766153/3358272? – r2evans Dec 23 '21 at 14:52

1 Answers1

1

Please find below one possible solution using the sf and dplyr libraries

Reprex

  • Code
library(sf)
library(dplyr)

# Convert the two dataframes into 'sf' object
df1_sf <- df1 %>% st_as_sf(., coords = c("longitude", "latitude"), crs = 4326)
df2_sf <- df2 %>% st_as_sf(., coords = c("longitude", "latitude"), crs = 4326)

# Left join the two dataframes based on nearest feature
results <- st_join(df1_sf, df2_sf, join = st_nearest_feature)

# Convert the results back into dataframe
coords <- results %>% 
  st_coordinates() %>% 
  as.data.frame() %>% 
  rename(., longitude = X, latitude = Y)

results <- results %>% 
  st_drop_geometry() %>% 
  as.data.frame() %>% 
  cbind(., coords)
  • Output
results
#>   amount City longitude latitude
#> 1      1    A         2       20
#> 2      4    A         8       18
#> 3      3    A        10       20
#> 4      6    B         1        5
  • Your data
df1 <- read.table(text="amount  latitude    longitude
1   20  2
4   18  8
3   20  10
6   5   1", header = TRUE)

df2 <- read.table(text="City    latitude    longitude
A   16  8
B   12  12
C   20  15
D   4   28", header = TRUE)

Created on 2021-12-23 by the reprex package (v2.0.1)

lovalery
  • 4,524
  • 3
  • 14
  • 28
  • Thanks a lot! That works. A bit weird is that is returns an sum of "Amount" that is 98,1% of the sum of the original data after joining them. Do you think that happens because of the allocation of cases to certain latitudes/longitudes? Also, I see you use crs = 4326. What would that mean? I can't seem to find an answer in the help section of the package – T. Troglodyte Dec 28 '21 at 15:34
  • Hi @T. Troglodyte. Thanks a lot for your feedback. Regarding your first question, actually, it's hard to answer without seeing your "real" data. That said, it is indeed weird: normally you should keep all your original data as in the reprex above. Regarding your second question, it's simpler: `crs = 4326` allows you to specify the geographic coordinate system (crs = Coordinates Reference System). And `4326` is the EPSG code which means that the reference system is the `WGS1984` geographic coordinate system (i.e. the one usually used with GPS). Hope this helps. Cheers. – lovalery Dec 28 '21 at 21:13