Dataframes of unequal length result in merge to introduce invalid values due to nearest datetime

Question

I merged two dataframes, ~251k rows and ~237k rows, respectively, based on nearest datetime. However, I have an issue with the output.

Here are small, hypothetical examples.

Large dataframe:

dflarge <- data.frame(datetime = c("2021-01-01 12:47:16", "2021-01-01 13:47:16", "2021-01-01 14:47:16", "2021-01-01 15:47:16", "2021-01-01 16:47:16"))

Converting to datetime format:

dflarge$datetime <- as.POSIXct(dflarge$datetime)

tibble(dflarge)

# A tibble: 5 x 1
  datetime           
  <dttm>             
1 2021-01-01 12:47:16
2 2021-01-01 13:47:16
3 2021-01-01 14:47:16
4 2021-01-01 15:47:16
5 2021-01-01 16:47:16

Small dataframe and necessary format conversions:

dfsmall <- data.frame(datetime = c("2021-01-01 15:00:00", "2021-01-01 16:00:00", "2021-01-01 17:00:00"), value = c("0.5", "1.0", "1.5"))

dfsmall$datetime <- as.POSIXct(dfsmall$datetime)

dfsmall$value <- as.numeric(dfsmall$value)

tibble(dfsmall)

# A tibble: 3 x 2
  datetime              value
  <dttm>              <dbl>
1 2021-01-01 15:00:00   0.5
2 2021-01-01 16:00:00   1  
3 2021-01-01 17:00:00   1.5

Now I perform the merge...

library(data.table)

setDT(dflarge)[, value := setDT(dfsmall)[dflarge, value, on = "datetime", roll = "nearest"]]

tibble(dflarge)

# A tibble: 5 x 2
  datetime              value
  <dttm>              <dbl>
1 2021-01-01 12:47:16   0.5
2 2021-01-01 13:47:16   0.5
3 2021-01-01 14:47:16   0.5
4 2021-01-01 15:47:16   1  
5 2021-01-01 16:47:16   1.5

Despite the logic behind the result, as you can see the first two records have also had the value 0.5 assigned to it but this is incorrect!

To remove or modify these values manually will not suffice i.e., scrolling through 1/4 million records and finding where the duplications start. The scripts I am compiling is for autonomous database merging and appending.

I basically require a function that only matches the dfshort$value to a dflarge$datetime that is around e.g., the nearest 1 hour of dfshort$datetime, not >= 2 hours apart AND, subsequently replace irrelevant values with NA or NAN.

s_baldur · Accepted Answer · 2021-03-17T13:24:25.687

1

Burrowing ideas from a solution to a similar question:

setDT(dflarge)
setDT(dfsmall)
dflarge[, joindatetime := datetime + 3600] # we add 3600 secs (one hour)
dflarge[, value := dfsmall[dflarge, value, on = .(datetime = joindatetime), roll = 7200]
        ][, joindatetime := NULL]
#               datetime value
# 1: 2021-01-01 12:47:16  <NA>
# 2: 2021-01-01 13:47:16  <NA>
# 3: 2021-01-01 14:47:16   0.5
# 4: 2021-01-01 15:47:16   1.0
# 5: 2021-01-01 16:47:16   1.5

edited Mar 17 '21 at 13:24

answered Mar 17 '21 at 13:19

s_baldur

29,441
4
36
69

Thanks @sindri_baldur, such genius is priceless! Re., searching for solutions, the issue is what keywords to search for and thus finding solutions remain an issue for less advanced users on this platform. – Rion Lerm Mar 18 '21 at 06:55

Dataframes of unequal length result in merge to introduce invalid values due to nearest datetime

1 Answers1