I have 2 data frames:
- One of them (named t1) consists of application id, user id, and application date
- the second one (named t2) consists of client id, scoring date, and score.
I would like to join the table t2 to table t1 using client id and the timestamps where we take the most recent value from t2 where the t2 date is not bigger than the t1 date field.
I could manage to merge the tables using SQL syntax. Unfortunately, I am having many thousands of rows in each table and hundreds of variables to join. THE current SQL method works but it is too slow. Is there a faster way how to do this task?
Here is the example with the R code:
t1 <- data.frame(
application_id = c(1:5),
application_time = c('2020-09-20',
'2020-09-22',
'2020-09-22',
'2020-09-23',
'2020-09-25'),
client_id = c(1, 2, 3, 1, 1)
)
t2 <- data.frame(
client_id = c(1, 1, 1, 1, 2, 3),
date_ = c('2020-09-19',
'2020-09-20',
'2020-09-25',
'2020-09-26',
'2020-09-26',
'2020-09-20'),
score = c(1:6)
)
## Merging the tables using SQL
library(sqldf)
df <- sqldf(
"SELECT
t1.*,
(SELECT t2.score FROM t2
WHERE t1.client_id = t2.client_id
AND t1.application_time >= t2.date_
ORDER BY t2.date_ DESC LIMIT 1) AS scorexxx
FROM t1"
)
The result:
application_id application_time client_id scorexxx
1 1 2020-09-20 1 2
2 2 2020-09-22 2 NA
3 3 2020-09-22 3 6
4 4 2020-09-23 1 2
5 5 2020-09-25 1 3