1

I have a similar situation of this question: Merging two sets of data by data.table roll='nearest' function

but, instead of merging two datasets by only one similar column, I want to merge those dfs by multiple columns according to the nearest values.

I have two data sets: temp_A (279 MB) and temp_B (15 MB)

temp_A = read.table("~/temp_A.txt", header = TRUE, sep = " ")
temp_B = read.table("~/temp_B.txt", header = TRUE, sep = " ")
   

 > temp_A
       depth  latitude longitude       var2
1   -3.79019 -44.09513 178.90263 0.01438153
2   -3.79019 -44.09513 178.99693 0.01442332
3   -3.79019 -44.09513 179.09120 0.01445489
4   -3.79019 -44.09513 179.18547 0.01447016
5   -3.79019 -44.09513 179.27978 0.01447304
    ---    
246 -3.79019 -44.03582  96.87613 0.01549349
247 -3.79019 -44.03582  96.97041 0.01547825
248 -3.79019 -44.03582  97.06470 0.01546115
249 -3.79019 -44.03582  97.15897 0.01543248
250 -3.79019 -44.03582  97.25326 0.01539848
 [ reached 'max' / getOption("max.print") -- omitted 7466023 rows ]

> temp_B
         latitude longitude     depth     var1 var1_stdev
     1: -39.88185   0.28285 -113.1400 2.786355 0.10973449
     2: -39.88185   0.28285  -84.8550 2.938529 0.07388042
     3: -39.88185   0.28285  -70.7125 2.980504 0.08564698
     4: -39.88185   0.28285  -56.5700 3.168203 0.10311580
     5: -39.88185   0.28285  -42.4275 3.368630 0.10937809
    ---                                                  
415747: -17.25385 203.36915  -28.2850 6.682614 0.69207738
415748: -17.25385 203.36915  -16.9710 6.541189 0.65926678
415749: -17.25385 203.36915  -11.3140 6.788966 0.62176087
415750: -17.25385 203.36915   -5.6570 6.420695 0.59754891
415751: -17.25385 203.36915    0.0000 6.206860 0.57254497

I want to add var2 to temp_B by the nearest or equal values of longitude, latitude and depth. I do not necessarily want to keep all the temp_A's rows that do not match with temp_B.

I tried to use this code, but I do not get the result expected:

setDT(temp_A)
setDT(temp_B)
setkey(temp_A, longitude, latitude, depth)
setkey(temp_B, longitude, latitude, depth)

# Should I create time column by which to do a rolling join?
# I do not see the point, so I did not.
#temp_A[, lon := longitude]
#temp_B[, lon := longitude]
#temp_A[, lat := latitude]
#temp_B[, lat := latitude]
#temp_A[, dep := depth]
#temp_B[, dep := depth]


#this should merge the two tables by the columns selected and considering the nearest value
temp_C = temp_B[temp_A, on = c("longitude", "latitude", "depth"), roll = "nearest"][order(latitude)]


 

This is what I get:

> summary(temp_C)
    latitude        longitude             depth              var1           var1_stdev           var2          
 Min.   :-44.10   Min.   :  0.04714   Min.   :-311.98   Min.   : NA       Min.   : NA       Min.   :0.0001259  
 1st Qu.:-36.21   1st Qu.: 56.42858   1st Qu.: -48.08   1st Qu.: NA       1st Qu.: NA       1st Qu.:0.0062410  
 Median :-30.74   Median :109.03868   Median : -31.11   Median : NA       Median : NA       Median :0.0134268  
 Mean   :-30.34   Mean   :105.07238   Mean   : -36.34   Mean   :NaN       Mean   :NaN       Mean   :0.0112185  
 3rd Qu.:-24.44   3rd Qu.:149.39195   3rd Qu.: -19.94   3rd Qu.: NA       3rd Qu.: NA       3rd Qu.:0.0160985  
 Max.   :-16.81   Max.   :203.60488   Max.   :  -3.79   Max.   : NA       Max.   : NA       Max.   :0.0191869  
                                                        NA's   :7466273   NA's   :7466273                      
> 

I obtained that The Var1 and Var1_stdev become just NAs:

temp_C[!is.na(temp_C$var1),]

Empty data.table (0 rows and 6 cols): latitude,longitude,depth,var1,var1_stdev,var2

Do you know why I get Nas? Do you have a better way to do this?

Strobila
  • 317
  • 3
  • 15

0 Answers0