The OP has asked to optimize a number of cascading sqldf
statements (before OP's edit). Unfortunately, the OP has not explained verbally what aggregations he has implemented. So, a substantial amount of reverse engineering was required.
Anyhow, here is what I would do using data.table
to achieve the same results. Execution time is down from 16 sec for OP's sqldf
code to less than 0.2 sec for the data.table
versions.
data.table
versions of edited example
The OP has edited the question to reduce the number of sqldf
statements. Now, only one aggregate is computed.
The new column winner_overall_wins
in data2
is the count of all matches the winner has won before the actual tourney has started. This number is attached to all matches of the actual tourney which were won by the winner. (Note this is a different aggregation than the count of matches which were won before the actual match).
Since version 1.9.8 (on CRAN 25 Nov 2016), data.table
is capable to do non-equi joins. In addition, fread()
can be advised to only read selected columns which further speeds up I/O.
library(data.table) # v1.11.2
urls <- sprintf(
"https://raw.githubusercontent.com/JeffSackmann/tennis_atp/master/atp_matches_%i.csv",
2000:2018)
selected_cols <- c("tourney_name", "tourney_date", "match_num",
"winner_id", "winner_name",
"loser_id", "loser_name")
# read only selected columns from files & combine into one data object
matches <- rbindlist(lapply(urls, fread, select = selected_cols))
# non-equi join to compute aggregate, second join to append, order result
system.time({
result_nej <- matches[
unique(matches[matches, on = .(winner_id, tourney_date < tourney_date),
.(winner_overall_wins = .N), by = .EACHI]),
on = .(winner_id, tourney_date)][
order(-tourney_date, tourney_name, -match_num)]
})
The two data.table
joins and the subsequent ordering took an elapsed time of around 0.15 sec on my system vs 16 to 19 sec for various runs of OP's sqldf
code.
The history of a particular player can be retrieved by
p_name <- "Federer"; result_nej[winner_name %like% p_name | loser_id %like% p_name]
tourney_name tourney_date match_num winner_id winner_name loser_id loser_name winner_overall_wins
1: Australian Open 20180115 701 103819 Roger Federer 105227 Marin Cilic 1128
2: Australian Open 20180115 602 103819 Roger Federer 111202 Hyeon Chung 1128
3: Australian Open 20180115 504 103819 Roger Federer 104607 Tomas Berdych 1128
4: Australian Open 20180115 408 103819 Roger Federer 105916 Marton Fucsovics 1128
5: Australian Open 20180115 316 103819 Roger Federer 104755 Richard Gasquet 1128
---
1131: Marseille 20000207 3 103819 Roger Federer 102179 Antony Dupuis 4
1132: Davis Cup WG R1: SUI vs AUS 20000204 2 103819 Roger Federer 102882 Mark Philippoussis 3
1133: Australian Open 20000117 90 103819 Roger Federer 102466 Jan Kroslak 1
1134: Australian Open 20000117 52 103819 Roger Federer 102021 Michael Chang 1
1135: Adelaide 20000103 2 103819 Roger Federer 102533 Jens Knippschild 0
There is an alternative and faster solution using cumsum()
and shift()
:
system.time({
# cumumlative operations require ordered data
setorder(matches, tourney_date, tourney_name, match_num)
# add tourney id for convenience and conciseness
matches[, t_id := rleid(tourney_date, tourney_name)]
# aggregate by player and tourney
p_t_hist <- matches[, .(winner_won = .N), by = .(winner_id, t_id)]
# compute cumulative sum for each player and
# lag to show only matches of previous tourneys
tmp <- p_t_hist[order(t_id),
.(t_id, winner_overall_wins = shift(cumsum(winner_won))),
by = winner_id]
# append new column & order result
result_css <- matches[tmp, on = .(t_id, winner_id)][order(-t_id)]
})
p_name <- "Federer"; result_css[winner_name %like% p_name | loser_id %like% p_name]
On my system, elapsed time is at about 0.05 sec which is 3 times faster than the non-equi join variant and magnitudes faster than OP's approach.