-1

I am new to R and have a very simple question (I hope). I did my homework to take guidance from examples presented here and here, but am unable to implement the idea in my case.

I've two data sets, df1 and df2. df1 has three columns (latitude, longitude, and monetary returns) and million of rows, and looks like (hypothetical):

df1.lat      df1.long    df1.ret
48.774768    9.145932    2100
48.771241    8.123451    3200
48.778512    7.145124    1200
48.454541    8.124512    7000
   .            .          .
   .            .          .
   .            .          .
   .            .          .

df2 has hundred of columns (latitude1, longitude1, latitude2, longitude2, A, B, C, and so on...) and 200 rows, and looks like (hypothetical):

    df2.lat1            df2.long1         df2.lat2       df2.long2   A  
   48.764111           8.115912          48.781431       9.152142    2
        .                  .                 .               .       .
        .                  .                 .               .       .
        .                  .                 .               .       .

I would like to merge df1 and df2 if the following two conditions are exactly met:

  • df1.lat is within df2.lat1 and df2.lat2
  • df1.long is within df2.long1 and df2.long2

If the above two conditions are met, I would like to extract the corresponding row of df2 and add it to df1. If the above two conditions are not met, I will drop the specific row from df1. Given the data I have, all the rows in df2 should match with some of the rows in df1 but not all rows in df1 may match with df2.

The result will look like:

 df1.lat   df1.long   df1.ret   df2.lat1   df2.long1  df2.lat2  df2.long2 A
48.774768  9.145932    2100     48.764111  8.115912   48.78143  9.152142  2
48.771241  8.123451    3200     48.764111  8.115912   48.78143  9.152142  2
  .           .          .        .            .         .          .       

  .           .           .       .            .    so on....

I tried using the sqldf package as:

install.packages("sqldf")
library(sqldf)
mergetest = sqldf("select * from df1 f1 inner join df2 f2 on (df2.lat1 <= df1.lat <= df2.lat2 and df2.long1 <= df1.long <= df2.long2) ")

When i try the above code, R returns mergetest with number of columns equal to sum of columns in df1 and df2 but 0 rows.

Any help will highly be appreciated. Thanks a lot!

P.S: Note that many of the longitudes can be negative so the matching needs to account for that. A latitude value in df1 within the latitude values in df2 means that the latitude point in df1 exists inside the range defined in df2.

B.W Niazi
  • 29
  • 5
  • 1
    It would be appreciated if you can post what you have tried. – pe-perry Oct 11 '17 at 03:37
  • To get useful answer for your question. Please post your question according to the guideline posted in https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – Santosh M. Oct 11 '17 at 03:41
  • Include reproducible dataset in your question. To write solution it is necessary to have valid input data sets. So, one can test its code on your data and get the desired output you are seeking. To get the data sets, you can just use `dput(head(df))` for your both data frame and post its output in the question. – Santosh M. Oct 11 '17 at 05:14
  • Since you're working with coordinates, taking a geographic approach may make things more straightforward. It sounds like you just want to know if a point (from df1) falls into a bounding box/polygon (from df2). `point.in.polygon` and `over` from the `sp` package will help identify which df1 rows match which df2 rows. The merge will then be very simple. – Jul Oct 11 '17 at 05:51

1 Answers1

0

A naive approach could be doing something like this:

new_df <- lapply(1:nrow(df2), function(i) {
    row <- df2[i,]
    logic1 <- df1[,1] < row[,3] & df1[,1] > row[,1]
    logic2 <- df1[,2] < row[,4] & df1[,2] > row[,2]
    logic <- logic1 & logic2

    tmp_df1 <- df1[logic, ]

    cbind(tmp_df1, row)  
})


do.call(rbind, new_df)

But probably using the sp package is more efficient if you have million of rows.

spadarian
  • 1,604
  • 10
  • 14