Need help optimizing cumsum like code - sqldf, data.table, non-equi joins

Question

Looking for help to optimize my sqldf code that generates aggregated historical stats based on non-equi joins, i.e. the data is meant to be aggregated only up to the current row of data.

It's important that any solution is able to work for many different groups, like filtering for aggregations by tourney_name, etc.. in the sqldf example.

Get data:

library(dplyr); library(sqldf); data_list <- list()

for(i in 2000:2018){
    data_list[[i]] <- 
        readr::read_csv(paste0('https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_',i,'.csv')) %>% 
        as.data.frame}

data <- data.table::rbindlist(data_list)
data <- select(data, tourney_name, tourney_date, match_num, winner_id, winner_name, loser_id, loser_name)

system.time(
    data2 <- sqldf("select a.*, 
                   count(b.winner_id) as winner_overall_wins
                   from data a 
                   left join data b 
                   on (a.winner_id = b.winner_id and a.tourney_date > b.tourney_date)
                   group by a.tourney_name, a.tourney_date, a.match_num, a.winner_id
                   order by tourney_date desc, tourney_name, match_num desc",
                   stringsAsFactors = FALSE)
) # takes 16 sec, would like to look for a vectorized solution

head(data2)

Approaches I've tried to speed up the code:

For loop - too slow
Dplyr full join/filter - blew up memory past 60gb.
Data.table/cumsum - couldn't get the code to work properly. Prefer non data.table approach but willing to learn for a generalizable solution

Thank you!

This question is way too long (for me). Generally speaking, if your question is longer than a page you're asking for too much. Simplify your problem to its core essence. — eddi, May 03 '18 at 21:53
Just wanted to show all my attempts... I'll try to simplify. Let me know if its better now. — BigTimeStats, May 03 '18 at 22:41
It's a start. It would be much better if you had a small(!) toy example instead with minimal number of columns and rows to illustrate what it is that you're trying to do. My guess is you want something along the lines of `a[b, on = .(p1_id, tourney_date > tourney_date), ...]`, but I don't feel like actually running all of your code to check, sorry. And do check out this guide on [asking great questions](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). — eddi, May 04 '18 at 15:29
Thanks for the help... I tried the join and aggregation in 1 swoop and it didn't work. Do you recommend doing the join, then the aggregation after? `data3a[data3b, list(p1_overall_wins2 = sum(p1_won), p1_overall_losses2 = sum(p1_won == 0), p1_overall_ace2 = mean(p1_ace, na.rm = TRUE)), by = list(tourney_name, tourney_date, match_num, p1_id), on = .(p1_id = p1_id2, tourney_date > tourney_date2), allow.cartesian = TRUE] ` — BigTimeStats, May 04 '18 at 15:49
Also focused heavily on making my example reproducible and wanted to check on a large dataset, hence the long example. I'll try to boil it down to more basic concepts, see if I can figure it out, and then ask a simpler question. Thanks! — BigTimeStats, May 04 '18 at 15:51
IME the only type of aggregation that reliably works during a join is `by = .EACHI`, everything else should probably be done after the join. — eddi, May 04 '18 at 15:53

Uwe · Accepted Answer · 2018-05-11T08:06:37.350

The OP has asked to optimize a number of cascading sqldf statements (before OP's edit). Unfortunately, the OP has not explained verbally what aggregations he has implemented. So, a substantial amount of reverse engineering was required.

Anyhow, here is what I would do using data.table to achieve the same results. Execution time is down from 16 sec for OP's sqldf code to less than 0.2 sec for the data.table versions.

`data.table` versions of edited example

The OP has edited the question to reduce the number of sqldf statements. Now, only one aggregate is computed.

The new column winner_overall_wins in data2 is the count of all matches the winner has won before the actual tourney has started. This number is attached to all matches of the actual tourney which were won by the winner. (Note this is a different aggregation than the count of matches which were won before the actual match).

Since version 1.9.8 (on CRAN 25 Nov 2016), data.table is capable to do non-equi joins. In addition, fread() can be advised to only read selected columns which further speeds up I/O.

library(data.table)  # v1.11.2

urls <- sprintf(
  "https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_%i.csv", 
  2000:2018)
selected_cols <- c("tourney_name", "tourney_date", "match_num", 
                   "winner_id", "winner_name", 
                   "loser_id", "loser_name") 

# read only selected columns from files & combine into one data object
matches <- rbindlist(lapply(urls, fread, select = selected_cols))

# non-equi join to compute aggregate, second join to append, order result 
system.time({
  result_nej <- matches[
    unique(matches[matches, on = .(winner_id, tourney_date < tourney_date), 
                   .(winner_overall_wins = .N), by = .EACHI]),
    on = .(winner_id, tourney_date)][
      order(-tourney_date, tourney_name, -match_num)]
})

The two data.table joins and the subsequent ordering took an elapsed time of around 0.15 sec on my system vs 16 to 19 sec for various runs of OP's sqldf code.

The history of a particular player can be retrieved by

p_name <- "Federer"; result_nej[winner_name %like% p_name | loser_id %like% p_name]

                     tourney_name tourney_date match_num winner_id   winner_name loser_id         loser_name winner_overall_wins
   1:             Australian Open     20180115       701    103819 Roger Federer   105227        Marin Cilic                1128
   2:             Australian Open     20180115       602    103819 Roger Federer   111202        Hyeon Chung                1128
   3:             Australian Open     20180115       504    103819 Roger Federer   104607      Tomas Berdych                1128
   4:             Australian Open     20180115       408    103819 Roger Federer   105916   Marton Fucsovics                1128
   5:             Australian Open     20180115       316    103819 Roger Federer   104755    Richard Gasquet                1128
  ---                                                                                                                           
1131:                   Marseille     20000207         3    103819 Roger Federer   102179      Antony Dupuis                   4
1132: Davis Cup WG R1: SUI vs AUS     20000204         2    103819 Roger Federer   102882 Mark Philippoussis                   3
1133:             Australian Open     20000117        90    103819 Roger Federer   102466        Jan Kroslak                   1
1134:             Australian Open     20000117        52    103819 Roger Federer   102021      Michael Chang                   1
1135:                    Adelaide     20000103         2    103819 Roger Federer   102533   Jens Knippschild                   0

There is an alternative and faster solution using cumsum() and shift():

system.time({
  # cumumlative operations require ordered data
  setorder(matches, tourney_date, tourney_name, match_num)
  # add tourney id for convenience and conciseness
  matches[, t_id := rleid(tourney_date, tourney_name)]
  # aggregate by player and tourney
  p_t_hist <- matches[, .(winner_won = .N), by = .(winner_id, t_id)]
  # compute cumulative sum for each player and 
  # lag to show only matches of previous tourneys
  tmp <- p_t_hist[order(t_id), 
                  .(t_id, winner_overall_wins = shift(cumsum(winner_won))), 
                  by = winner_id]
  # append new column & order result
  result_css <- matches[tmp, on = .(t_id, winner_id)][order(-t_id)]
})
p_name <- "Federer"; result_css[winner_name %like% p_name | loser_id %like% p_name]

On my system, elapsed time is at about 0.05 sec which is 3 times faster than the non-equi join variant and magnitudes faster than OP's approach.

Thanks for the work! I made the edits to simplify my question based on other people's feedback. I was able to utilize your data.table solution for my purposes and it's working very well. — BigTimeStats, May 11 '18 at 21:00
For your data.table solution, does data.table allow joining on a parameter such as `on = .(winner_id, tourney_date < tourney_date, tourney_date > tourney_date - 365)` [in order to just get summarized wins for the last year, for example]? Getting an error that `tourney_date + 30` is not found — BigTimeStats, May 15 '18 at 02:58
Please, try to add a computed column, e.g., `tourney_date_m365 := tourney_date - 365` to the second data.table and use the new column in the join, e.g. `on = .(winner_id, tourney_date < tourney_date, tourney_date > tourney_date_m365` — Uwe, May 15 '18 at 05:54
Was starting down that route as well... This is great, thanks for all your help! — BigTimeStats, May 15 '18 at 15:55

Need help optimizing cumsum like code - sqldf, data.table, non-equi joins

1 Answers1

data.table versions of edited example

`data.table` versions of edited example