I have 2 dataframes:
df1<-structure(list(id = c(1, 2), date = structure(c(1483636800, 1485192000
), class = c("POSIXct", "POSIXt"), tzone = "")), class = "data.frame", row.names = c(NA,
-2L))
df2<-structure(list(id.1 = structure(1:3, .Label = c("A", "B", "C"
), class = "factor"), sunrise = structure(c(1483617946, 1485198384,
1485205584), class = c("POSIXct", "POSIXt"), tzone = "")), class = "data.frame", row.names = c(NA,
-3L))
df1:
id date
1 1 2017-01-05 12:20:00
2 2 2017-01-23 12:20:00
df2:
id.1 sunrise
1 A 2017-01-05 07:05:46
2 B 2017-01-23 14:06:24
3 C 2017-01-23 16:06:24
I would like to find the closest sunrise date in df2 to each of the dates in df1, and then calculate the time difference (in hours) and put these new values in new column "closest" in df1. To find the time until sunrise, whereby a negative value would indicate a period after sunrise and a positive value a period before sunrise.
My df1 is very large > 100 million rows, so it's important that the solution works fast and efficient. The solution I found cannot accommodate the size of my data set, not even on a single column (will return: "Error: vector memory exhausted (limit reached?)". Attempts to remedy this issue were so far unsuccessful.
sDTA <- data.table(df1)[2]
rDTB <- data.table(df2)
try1<-sDTA[, closest := rDTB[sDTA, on = .(sunrise = date), roll = "nearest", x.sunrise]][]
try1$Time_until_sunrise<difftime(try1$closest,try1$date,units="hours")
I previously encountered exhausted memory issues while trying to use aggregate() functions, but was able to 'repair' this by replacing these with those that used dplyr. Perhaps this could be a solution again.