I have many large (300k - 1M row) dataframes that I'm trying to append values to by looping through the dataframes (df_i
) and for each one, looping through the rows and asking what the value in the second dataframe is (do2
) at matching latitude, longitude, month, and depth. Lat/lon/month will match exactly, the depth is trickier since do2 has 57 columns for values in increasing depth bins:
The core of my by-row loop code is a 3-ands row subset & a column subset:
for (j in 1:nrow(df_i)) {
df_i[j,"DO2"] <- do2[do2$Latitude == df_i[j,"latbin"] &
do2$Longitude == df_i[j,"lonbin"] &
do2$Month == month(df_i[j,"DateTimeUTCmin5"]),
which.min(abs(depthbins - df_i[j, "Depth.m."])) + 3]
}
This works, but it's slow. I know it could be sped up but my parallelisation efforts keep hitting walls, and debug/traceback is much harder in parallel. I tried FBM after reading this but get
value must be unique or of the dimension of x[i, j]
some 200k rows in. I understand data.table indexes are fast, so possibly something like Frank's comment here might work, maybe a multi-row subset in data.table? But presumably that would be the same approach as my existing solution (since I also need to subset/lookup columns), just maybe a bit quicker?
Does anyone know of a smarter approach? I've previously been confounded by apply functions but wouldn't be surprised if there was something useful there?
Thanks in advance.
Reproducible (simplified month, depthbins added as was omitted before):
depthbins <- c(0,5,10,15,20,25,50,75,100,125,150,200,250,300,350,400)
df_i <- data.frame(latbin = c(-77.5, -78, -78.5),
lonbin = c(-178.5, -177.5, -176.5),
month = c(1,2,3),
Depth.m. = c(130,120,110))
do2 <- tibble(Month = c(1,1,1),
Latitude = c(-78,-78,-79),
Longitude = c(-178.5, -177.5, -177.5),
"0" = c(214, 223, 345),
"5" = c(123,234,345),
"10" = c(345,456,567))
Final edit: some tweaks to Marius' code:
do2 %<>% gather(.vars = colnames(do2)[4:length(colnames(do2))],
key = "depbin", value = "DO2")
do2$depbin <- as.numeric(do2$depbin)
depthbins <- sort(unique(do2$depbin))
df_i$depbin = sapply(df_i$Depth.m., function(d) depthbins[which.min(abs(depthbins - d))])
df_i %<>% left_join(do2, by = c("Month" = "Month",
"latbin" = "Latitude",
"lonbin" = "Longitude",
"depbin" = "depbin")) %>%
select(-Month, -latbin, -lonbin, -depbin)