0

I have 2 data frames:

  • One of them (named t1) consists of application id, user id, and application date
  • the second one (named t2) consists of client id, scoring date, and score.

I would like to join the table t2 to table t1 using client id and the timestamps where we take the most recent value from t2 where the t2 date is not bigger than the t1 date field.

I could manage to merge the tables using SQL syntax. Unfortunately, I am having many thousands of rows in each table and hundreds of variables to join. THE current SQL method works but it is too slow. Is there a faster way how to do this task?

Here is the example with the R code:

t1 <- data.frame(
  application_id = c(1:5),
  application_time = c('2020-09-20', 
                       '2020-09-22',
                       '2020-09-22',
                       '2020-09-23',
                       '2020-09-25'),
  client_id = c(1, 2, 3, 1, 1)
)

t2 <- data.frame(
  client_id = c(1, 1, 1, 1, 2, 3),
  date_ = c('2020-09-19',
            '2020-09-20',
            '2020-09-25',
            '2020-09-26',
            '2020-09-26',
            '2020-09-20'),
  score = c(1:6)
)

## Merging the tables using SQL
library(sqldf)
df <- sqldf(
  "SELECT 
    t1.*,
    (SELECT t2.score FROM t2
      WHERE t1.client_id = t2.client_id
        AND t1.application_time >= t2.date_
      ORDER BY t2.date_ DESC LIMIT 1) AS scorexxx
  FROM t1"
)

The result:

  application_id application_time client_id scorexxx
1              1       2020-09-20         1        2
2              2       2020-09-22         2       NA
3              3       2020-09-22         3        6
4              4       2020-09-23         1        2
5              5       2020-09-25         1        3
Bertil Baron
  • 4,923
  • 1
  • 15
  • 24
LordRudolf
  • 63
  • 8
  • 2
    Pehaps have a look at nearest : https://stackoverflow.com/questions/54013468/merging-two-sets-of-data-by-data-table-roll-nearest-function – Waldi Sep 28 '20 at 08:48
  • 1
    Can you try if this `dplyr` solution is of any help? `inner_join(t1, t2, by = 'client_id') %>% mutate(diff = application_time - date_) %>% filter(diff >= 0) %>% group_by(application_id) %>% slice(which.min(diff))` and is any faster? – Ronak Shah Sep 28 '20 at 10:04
  • you could look into the package `data.table` – Bertil Baron Sep 28 '20 at 10:04
  • Thank you @Waldi & Bertil Baron. Data table merged data almost instantaneously. Ronak Shah, I didn't quite try your solution as the data.table on the big data as the data.table vanished all the problems but definitely it seems more feasible than SQL. – LordRudolf Sep 28 '20 at 12:29

1 Answers1

1

I hope you find the solution helpful. Note: please check the comments for helpful material too.

Thank you for providing the creation queries in your question. The cte saves a temporary view of a JOIN of your tables with the parameters to be the same client_id and the date comparison. After that the MAX discards the unneeded rows that have been created in cte. The score is not included as it would result in a table with more lines. For that purpose, a final JOIN is performed to bring the score information from t2.

df1 <- sqldf(
  "WITH cte AS (SELECT t1.*, t2.*
  FROM t1
  LEFT JOIN t2
  ON t1.client_id = t2.client_id AND t1.application_time >= t2.date_)
  SELECT A.*, B.score
  FROM (SELECT cte.application_id , cte.application_time , cte.client_id , MAX(cte.date_) AS my_Date
  FROM cte
  GROUP BY cte.application_id , cte.application_time , cte.client_id)A
  LEFT JOIN t2 AS B
  ON A.client_id = B.client_id AND A.my_Date = B.date_"
)
JoPapou13
  • 753
  • 3
  • 8
  • Thanks for the recommendation. Previously mentioned data.table provides the most optimal merge speed: #### t1[, join_time := application_time] t2[, join_time := date_] setkey(t1, client_id, join_time) setkey(t2, client_id, join_time) t1[t2, roll = -Inf] ### This query logic you provided will help me with fast MySQL joins. – LordRudolf Oct 01 '20 at 06:35