I merged two dataframes, ~251k rows and ~237k rows, respectively, based on nearest datetime. However, I have an issue with the output.
Here are small, hypothetical examples.
Large dataframe:
dflarge <- data.frame(datetime = c("2021-01-01 12:47:16", "2021-01-01 13:47:16", "2021-01-01 14:47:16", "2021-01-01 15:47:16", "2021-01-01 16:47:16"))
Converting to datetime format:
dflarge$datetime <- as.POSIXct(dflarge$datetime)
tibble(dflarge)
# A tibble: 5 x 1
datetime
<dttm>
1 2021-01-01 12:47:16
2 2021-01-01 13:47:16
3 2021-01-01 14:47:16
4 2021-01-01 15:47:16
5 2021-01-01 16:47:16
Small dataframe and necessary format conversions:
dfsmall <- data.frame(datetime = c("2021-01-01 15:00:00", "2021-01-01 16:00:00", "2021-01-01 17:00:00"), value = c("0.5", "1.0", "1.5"))
dfsmall$datetime <- as.POSIXct(dfsmall$datetime)
dfsmall$value <- as.numeric(dfsmall$value)
tibble(dfsmall)
# A tibble: 3 x 2
datetime value
<dttm> <dbl>
1 2021-01-01 15:00:00 0.5
2 2021-01-01 16:00:00 1
3 2021-01-01 17:00:00 1.5
Now I perform the merge...
library(data.table)
setDT(dflarge)[, value := setDT(dfsmall)[dflarge, value, on = "datetime", roll = "nearest"]]
tibble(dflarge)
# A tibble: 5 x 2
datetime value
<dttm> <dbl>
1 2021-01-01 12:47:16 0.5
2 2021-01-01 13:47:16 0.5
3 2021-01-01 14:47:16 0.5
4 2021-01-01 15:47:16 1
5 2021-01-01 16:47:16 1.5
Despite the logic behind the result, as you can see the first two records have also had the value 0.5
assigned to it but this is incorrect!
To remove or modify these values manually will not suffice i.e., scrolling through 1/4 million records and finding where the duplications start. The scripts I am compiling is for autonomous database merging and appending.
I basically require a function that only matches the dfshort$value
to a dflarge$datetime
that is around e.g., the nearest 1 hour of dfshort$datetime
, not >= 2 hours apart AND, subsequently replace irrelevant values with NA
or NAN
.