This is my first question on stackoverflow, so feel free to criticize the question.
For every row in a data set, I would like to sum the rows that:
- have identical 'team', 'season' and 'simulation_ID'.
- have 'match_ID' smaller than (and not equal to) the current 'match_ID'.
such that I find the accumulated number of points up to that match, for that team, season and simulation_ID, i.e. cumsum(simulation$team_points).
I have issues to implement the second condition without using an extremely slow for-loop.
The data looks like this:
match_ID | season | simulation_ID | home_team | team | match_result | team_points |
---|---|---|---|---|---|---|
2084 | 2020-2021 | 1 | TRUE | Liverpool | Away win | 0 |
2084 | 2020-2021 | 2 | TRUE | Liverpool | Draw | 1 |
2084 | 2020-2021 | 3 | TRUE | Liverpool | Away win | 0 |
2084 | 2020-2021 | 4 | TRUE | Liverpool | Away win | 0 |
2084 | 2020-2021 | 5 | TRUE | Liverpool | Home win | 3 |
2084 | 2020-2021 | 1 | FALSE | Burnley | Home win | 0 |
2084 | 2020-2021 | 2 | FALSE | Burnley | Draw | 1 |
My current solution is:
simulation$accumulated_points <- 0
for (row in 1:nrow(simulation)) {
simulation$accumulated_points[row] <-
sum(simulation$team_points[simulation$season==simulation$season[row] &
simulation$match_ID<simulation$match_ID[row] &
simulation$simulation_ID==simulation$simulation_ID[row] &
simulation$team==simulation$team[row]], na.rm = TRUE)
}
This works, but it is obviously too slow to use on large data sets. I cannot figure out how to speed it up. What is a good solution here?