1

Suppose I have two data frame, df1 and df2.

df1 <- data.frame(value = 1:5, timestamp = as.POSIXct( c( "2020-03-02 12:20:00", "2020-03-02 12:20:01", "2020-03-02 12:20:03" , "2020-03-02 12:20:05", "2020-03-02 12:20:08")))

df2 <- data.frame(value = 6:10, timestamp = as.POSIXct( c( "2020-03-02 12:20:01", "2020-03-02 12:20:02", "2020-03-02 12:20:03" , "2020-03-02 12:20:04", "2020-03-02 12:20:05")))

df1

value timestamp
1 2020-03-02 12:20:00
2 2020-03-02 12:20:01
3 2020-03-02 12:20:03
4 2020-03-02 12:20:05
5 2020-03-02 12:20:08

df2

value timestamp
6 2020-03-02 12:20:01
7 2020-03-02 12:20:02
8 2020-03-02 12:20:03
9 2020-03-02 12:20:04
10 2020-03-02 12:20:05

Now, I want to keep df1, and left join with df2 by timestamp, since the timestamp is not exactly the same, what I want to do is:

  1. If there is an exact match, then just left join the value from df2
  2. If there is not an exact match, then try to match with the latest timestamp, and left join that value
  3. If there is not a match (no latest timestamp), then return NA

Therefore, my expect output would be like this

data.frame(df1, value.df2 = c(NA, 6, 8, 10, 10))
value timestamp value.df2
1 2020-03-02 12:20:00 NA
2 2020-03-02 12:20:01 6
3 2020-03-02 12:20:03 8
4 2020-03-02 12:20:05 10
5 2020-03-02 12:20:08 10

I hope I could do this by tidyverse or data.table.

Bob
  • 37
  • 6
  • `data.table` has a `roll` argument which may help : https://stackoverflow.com/questions/23342647/how-to-match-by-nearest-date-from-two-data-frames?answertab=votes#tab-top – user20650 Jan 23 '21 at 12:46

2 Answers2

4

Here are several alternatives. I find the SQL solution the most descriptive. The base solution is pretty short and has no dependencies. The data.table approach is likely fast and the code is compact but you need to read the documentation carefully to determine whether or not it is doing what you want since it is not obvious from the code unlike the prior two solutions. The dplyr/fuzzyjoin solution may be of interest if you are using the tidyverse.

1) sqldf Perform a left self join such that we join to each a row all b rows having a timestamp less than or equal to it and then take only the b row having the maximum timestamp of the ones joined to each a row. Note that SQLite guarantees that when max is used on a particular field that any other column references in the same table will be to that same row.

For large data add the argument dbname = tempfile() to the sqldf call and it will perform the join out of memory so that R memory limitations don't apply. It would also be possible to add an index to the data to speed it up.

library(sqldf)

sqldf("select max(b.timestamp), a.*, b.value as 'value.df2'
  from df1 a
  left join df2 b on b.timestamp <= a.timestamp
  group by a.timestamp
  order by a.timestamp"
)[-1]

giving:

  value           timestamp value.df2
1     1 2020-03-02 12:20:00        NA
2     2 2020-03-02 12:20:01         6
3     3 2020-03-02 12:20:03         8
4     4 2020-03-02 12:20:05        10
5     5 2020-03-02 12:20:08        10

Note that it can be used within a magrittr pipeline by placing the sqldf statement within brace brackets and referring to the left hand side as [.] within the sql statement:

library(magrittr)
library(sqldf)

df1 %>%
  { sqldf("select max(b.timestamp), a.*, b.value as 'value.df2'
      from [.] a
      left join df2 b on b.timestamp <= a.timestamp
      group by a.timestamp
      order by a.timestamp")[-1]
  }

2) base For each timestamp find the ones that are less than or equal to it and take the last one or NA if none.

Match <- function(tt) with(df2, tail(c(NA, value[timestamp <= tt]), 1))
transform(df1, value.df2 = sapply(timestamp, Match))

3) data.table This package supports rolling joins:

as.data.table(df2)[df1, on = .(timestamp), roll = TRUE]

4) dplyr/fuzzyjoin the fuzzy_left_join joins all rows of df2 to df1 whose timestamp is less than or equal to it. Then for each joined row we take the last one and fix up the names.

library(dplyr)
library(fuzzyjoin)

df1 %>%
  fuzzy_left_join(df2, by = "timestamp", match_fun = `>=`) %>%
  group_by(timestamp.x) %>%
  slice(n = n()) %>%
  ungroup %>%
  select(timestamp = timestamp.x, value = value.x, value.df2 = value.y)

  
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
  • Thanks for the help, since I'm using the tidyverse, the dplyr/fuzzyjoin solution works well for me. – Bob Jan 25 '21 at 01:53
  • Since the two data frames are quite large (one million rows), the tidyverse version would encounter the memory problem. I think I have to learn the data.table, the document for data.table is not as clear as tidyverse. – Bob Jan 25 '21 at 03:07
  • `fuzzyjoin` version doesn't seem to scale well if your `data.frame`s are between 10^5 and 10^6 rows – Matias Andina Aug 03 '22 at 15:24
0

Use tidyverse package this simple way

df1 <- data.frame(value = 1:5, timestamp = as.POSIXct( c( "2020-03-02 12:20:00", "2020-03-02 12:20:01", "2020-03-02 12:20:03" , "2020-03-02 12:20:05", "2020-03-02 12:20:08")))
df2 <- data.frame(value = 6:10, timestamp = as.POSIXct( c( "2020-03-02 12:20:01", "2020-03-02 12:20:02", "2020-03-02 12:20:03" , "2020-03-02 12:20:04", "2020-03-02 12:20:05")))

library(tidyverse)
left_join(df1, df2, by = 'timestamp')

 #value.x           timestamp value.y
 #1       1 2020-03-02 12:20:00      NA
 #2       2 2020-03-02 12:20:01       6
 #3       3 2020-03-02 12:20:03       8
 #4       4 2020-03-02 12:20:05      10
 #5       5 2020-03-02 12:20:08      NA
Daniel James
  • 1,381
  • 1
  • 10
  • 28
  • Thanks, but your output is different from my expect output (row 5). – Bob Jan 25 '21 at 01:44
  • Can you tell me that the element in `4th row, 3rd column ` and `5th row, 3rd column ` are not the same? `10`, `10` – Daniel James Jan 25 '21 at 02:01
  • The timestamps for df1's 4th row and 5th row are 12:20:05 and 12:20:08; My intention is try to find the df2's row, whose timestamp is less or equal to df1's row, and join the value of df2 to df1. Since the timestamp for d2'5 5th row is 12:20:05 and the value is 10, so value 10 will just copy to df1's 4th row and 5th row. – Bob Jan 25 '21 at 02:39