0

I have two databases, one with daily metrics df and one with benchmarked metrics (with a date attached) benchmark_df. I would like to find the percentage of the daily metrics compared to the max of the benchmark metrics - but only for benchmark dates that are less than or equal to the previous dates.

So a straight forward answer would be using left_join() and mutate(perc_benchmark = metric_a / benchmark_metric_a) where benchmark_metric_a is the the max value less than or equal to the metric_a date... but I'm not sure if I can do this with a left_join() or if there is a workaround.

So in the example data frames below. Player B would have a perc_benchmark of 50% (0.5/1) from 8-1 to 8-6, 8-7 would be 25% (0.5/2), and then 8-8 would be 75% (1.5/2), etc.

df <- tibble(player =  rep(LETTERS[1:2], times = 21),
             date = as.Date(rep(seq(as.Date('2022-08-01'), by = 'day', length.out= 21), each = 2)),
             # metric_a values are only patterned for testing purposes, they would not be in the actual dataframe
             metric_a = c(rep(0.5, times = 14),rep(c(0.5, 1.5), times = 7),rep(c(0.5, 3), times = 7)))

benchmark_df <- tibble(player = rep(LETTERS[1:2], times = 3),
                       date = as.Date(rep(c('2022-08-01', '2022-08-07', '2022-08-16'), each = 2)),
                       benchmark_metric_a = c(1,1,0.5,2,1,3))
seansteele
  • 619
  • 3
  • 10
  • You can do [non-equi joins](https://www.r-bloggers.com/2021/02/the-unequalled-joy-of-non-equi-joins/#non-equi-joins-with-sqldf) in R, with [**`sqldf`**](https://cran.r-project.org/package=sqldf) or [**`data.table`**](https://cran.r-project.org/package=data.table). Also, you you mind providing a **sample of your desired output**? – Greg Aug 30 '22 at 19:18

1 Answers1

1

The dplyr package is developing support for non-equi *_join()s.

...using left_join()...where benchmark_metric_a is the the max value less than or equal to the metric_a date...

In the meantime, here's a solution that uses a non-equi join with the data.table package, whose performance is superior at scale.

Solution

First prepare your datasets as data.tables.

library(data.table)


# ...
# Code to generate 'df' and 'benchmark_df'.
# ...


# Convert datasets into 'data.table's.
setDT(df)
setDT(benchmark_df)

Then apply the following data.table "chain", which is analogous to the dplyr workflow:

# Perform a LEFT JOIN of 'df' to 'benchmark_df'...
result <- benchmark_df[df, .(  
    # ...using a subset of columns...
    player,
    bench_date = x.date,
    benchmark_metric_a,
    metric_date = i.date,
    metric_a
  ),
  # ...where players match and benchmark dates are earlier (or concurrent). 
  on = .(player, date <= date)
  
# Calculate the max benchmark...
][, .(
    benchmark_metric_a = max(benchmark_metric_a),
    # ...while preserving the metric...
    metric_a = first(metric_a)
  ),
  # ...for each player as of each date.
  by = .(player, metric_date)
  
# Calculate the percentage.
][, `:=`(
  perc_benchmark = metric_a / benchmark_metric_a
  
# Select the desired columns.
)][, .(
  player,
  date = metric_date,
  metric_a,
  perc_benchmark
  
# Restore original ordering from 'df': sort by 'date' and then 'player'.
)][order(
  date,
  player
)]

Result

Given a df and benchmark_df like your examples

library(dplyr)

df <- tibble(
  player =  rep(LETTERS[1:2], times = 21),
  date = as.Date(rep(seq(as.Date('2022-08-01'), by = 'day', length.out= 21), each = 2)),
  # metric_a values are only patterned for testing purposes, they would not be in the actual dataframe
  metric_a = c(rep(0.5, times = 14),rep(c(0.5, 1.5), times = 7),rep(c(0.5, 3), times = 7))
)

benchmark_df <- tibble(
  player = rep(LETTERS[1:2], times = 3),
  date = as.Date(rep(c('2022-08-01', '2022-08-07', '2022-08-16'), each = 2)),
  benchmark_metric_a = c(1,1,0.5,2,1,3)
)

this chain should yield the following data.table for result, where the df data is now augmented with the perc_benchmark column:

    player       date metric_a perc_benchmark
 1:      A 2022-08-01      0.5           0.50
 2:      B 2022-08-01      0.5           0.50
 3:      A 2022-08-02      0.5           0.50
 4:      B 2022-08-02      0.5           0.50
 5:      A 2022-08-03      0.5           0.50
 6:      B 2022-08-03      0.5           0.50
 7:      A 2022-08-04      0.5           0.50
 8:      B 2022-08-04      0.5           0.50
 9:      A 2022-08-05      0.5           0.50
10:      B 2022-08-05      0.5           0.50
11:      A 2022-08-06      0.5           0.50
12:      B 2022-08-06      0.5           0.50
13:      A 2022-08-07      0.5           0.50
14:      B 2022-08-07      0.5           0.25
15:      A 2022-08-08      0.5           0.50
16:      B 2022-08-08      1.5           0.75
17:      A 2022-08-09      0.5           0.50
18:      B 2022-08-09      1.5           0.75
19:      A 2022-08-10      0.5           0.50
20:      B 2022-08-10      1.5           0.75
21:      A 2022-08-11      0.5           0.50
22:      B 2022-08-11      1.5           0.75
23:      A 2022-08-12      0.5           0.50
24:      B 2022-08-12      1.5           0.75
25:      A 2022-08-13      0.5           0.50
26:      B 2022-08-13      1.5           0.75
27:      A 2022-08-14      0.5           0.50
28:      B 2022-08-14      1.5           0.75
29:      A 2022-08-15      0.5           0.50
30:      B 2022-08-15      3.0           1.50
31:      A 2022-08-16      0.5           0.50
32:      B 2022-08-16      3.0           1.00
33:      A 2022-08-17      0.5           0.50
34:      B 2022-08-17      3.0           1.00
35:      A 2022-08-18      0.5           0.50
36:      B 2022-08-18      3.0           1.00
37:      A 2022-08-19      0.5           0.50
38:      B 2022-08-19      3.0           1.00
39:      A 2022-08-20      0.5           0.50
40:      B 2022-08-20      3.0           1.00
41:      A 2022-08-21      0.5           0.50
42:      B 2022-08-21      3.0           1.00
    player       date metric_a perc_benchmark

Note

If you want to convert this back into tibble form, then simply use as_tibble() on the result.

Greg
  • 3,054
  • 6
  • 27
  • 1
    Working my way through your solution and it looks like the max call is grabbing the max from both players where as I want to grab the max per player. Probably wasn't clear in my question. This is my first time using data.table to this extent so I'm not sure how to add that into your existing code. – seansteele Aug 31 '22 at 00:00
  • 1
    I looked at your edits and it appears you had it initially with the `on = .(player, date >=date)` – seansteele Aug 31 '22 at 00:03
  • @seansteele That's probably the way to go: `on = .(player, date >=date)`. – Greg Aug 31 '22 at 01:22
  • @seansteele I have edited the solution accordingly. If the revised solution achieves your desired result, kindly upvote. :) – Greg Aug 31 '22 at 01:42