I have a similar situation of this question: Merging two sets of data by data.table roll='nearest' function
but, instead of merging two datasets by only one similar column, I want to merge those dfs by multiple columns according to the nearest values.
I have two data sets: temp_A (279 MB) and temp_B (15 MB)
temp_A = read.table("~/temp_A.txt", header = TRUE, sep = " ")
temp_B = read.table("~/temp_B.txt", header = TRUE, sep = " ")
> temp_A
depth latitude longitude var2
1 -3.79019 -44.09513 178.90263 0.01438153
2 -3.79019 -44.09513 178.99693 0.01442332
3 -3.79019 -44.09513 179.09120 0.01445489
4 -3.79019 -44.09513 179.18547 0.01447016
5 -3.79019 -44.09513 179.27978 0.01447304
---
246 -3.79019 -44.03582 96.87613 0.01549349
247 -3.79019 -44.03582 96.97041 0.01547825
248 -3.79019 -44.03582 97.06470 0.01546115
249 -3.79019 -44.03582 97.15897 0.01543248
250 -3.79019 -44.03582 97.25326 0.01539848
[ reached 'max' / getOption("max.print") -- omitted 7466023 rows ]
> temp_B
latitude longitude depth var1 var1_stdev
1: -39.88185 0.28285 -113.1400 2.786355 0.10973449
2: -39.88185 0.28285 -84.8550 2.938529 0.07388042
3: -39.88185 0.28285 -70.7125 2.980504 0.08564698
4: -39.88185 0.28285 -56.5700 3.168203 0.10311580
5: -39.88185 0.28285 -42.4275 3.368630 0.10937809
---
415747: -17.25385 203.36915 -28.2850 6.682614 0.69207738
415748: -17.25385 203.36915 -16.9710 6.541189 0.65926678
415749: -17.25385 203.36915 -11.3140 6.788966 0.62176087
415750: -17.25385 203.36915 -5.6570 6.420695 0.59754891
415751: -17.25385 203.36915 0.0000 6.206860 0.57254497
I want to add var2
to temp_B
by the nearest or equal values of longitude
, latitude
and depth
. I do not necessarily want to keep all the temp_A
's rows that do not match with temp_B
.
I tried to use this code, but I do not get the result expected:
setDT(temp_A)
setDT(temp_B)
setkey(temp_A, longitude, latitude, depth)
setkey(temp_B, longitude, latitude, depth)
# Should I create time column by which to do a rolling join?
# I do not see the point, so I did not.
#temp_A[, lon := longitude]
#temp_B[, lon := longitude]
#temp_A[, lat := latitude]
#temp_B[, lat := latitude]
#temp_A[, dep := depth]
#temp_B[, dep := depth]
#this should merge the two tables by the columns selected and considering the nearest value
temp_C = temp_B[temp_A, on = c("longitude", "latitude", "depth"), roll = "nearest"][order(latitude)]
This is what I get:
> summary(temp_C)
latitude longitude depth var1 var1_stdev var2
Min. :-44.10 Min. : 0.04714 Min. :-311.98 Min. : NA Min. : NA Min. :0.0001259
1st Qu.:-36.21 1st Qu.: 56.42858 1st Qu.: -48.08 1st Qu.: NA 1st Qu.: NA 1st Qu.:0.0062410
Median :-30.74 Median :109.03868 Median : -31.11 Median : NA Median : NA Median :0.0134268
Mean :-30.34 Mean :105.07238 Mean : -36.34 Mean :NaN Mean :NaN Mean :0.0112185
3rd Qu.:-24.44 3rd Qu.:149.39195 3rd Qu.: -19.94 3rd Qu.: NA 3rd Qu.: NA 3rd Qu.:0.0160985
Max. :-16.81 Max. :203.60488 Max. : -3.79 Max. : NA Max. : NA Max. :0.0191869
NA's :7466273 NA's :7466273
>
I obtained that The Var1 and Var1_stdev become just NAs:
temp_C[!is.na(temp_C$var1),]
Empty data.table (0 rows and 6 cols): latitude,longitude,depth,var1,var1_stdev,var2
Do you know why I get Nas? Do you have a better way to do this?