I am new to R and have a very simple question (I hope). I did my homework to take guidance from examples presented here and here, but am unable to implement the idea in my case.
I've two data sets, df1 and df2. df1 has three columns (latitude, longitude, and monetary returns) and million of rows, and looks like (hypothetical):
df1.lat df1.long df1.ret
48.774768 9.145932 2100
48.771241 8.123451 3200
48.778512 7.145124 1200
48.454541 8.124512 7000
. . .
. . .
. . .
. . .
df2 has hundred of columns (latitude1, longitude1, latitude2, longitude2, A, B, C, and so on...) and 200 rows, and looks like (hypothetical):
df2.lat1 df2.long1 df2.lat2 df2.long2 A
48.764111 8.115912 48.781431 9.152142 2
. . . . .
. . . . .
. . . . .
I would like to merge df1 and df2 if the following two conditions are exactly met:
- df1.lat is within df2.lat1 and df2.lat2
- df1.long is within df2.long1 and df2.long2
If the above two conditions are met, I would like to extract the corresponding row of df2 and add it to df1. If the above two conditions are not met, I will drop the specific row from df1. Given the data I have, all the rows in df2 should match with some of the rows in df1 but not all rows in df1 may match with df2.
The result will look like:
df1.lat df1.long df1.ret df2.lat1 df2.long1 df2.lat2 df2.long2 A
48.774768 9.145932 2100 48.764111 8.115912 48.78143 9.152142 2
48.771241 8.123451 3200 48.764111 8.115912 48.78143 9.152142 2
. . . . . . .
. . . . . so on....
I tried using the sqldf package as:
install.packages("sqldf")
library(sqldf)
mergetest = sqldf("select * from df1 f1 inner join df2 f2 on (df2.lat1 <= df1.lat <= df2.lat2 and df2.long1 <= df1.long <= df2.long2) ")
When i try the above code, R returns mergetest with number of columns equal to sum of columns in df1 and df2 but 0 rows.
Any help will highly be appreciated. Thanks a lot!
P.S: Note that many of the longitudes can be negative so the matching needs to account for that. A latitude value in df1 within the latitude values in df2 means that the latitude point in df1 exists inside the range defined in df2.