1

This is my first question on stackoverflow, so feel free to criticize the question.

For every row in a data set, I would like to sum the rows that:

  • have identical 'team', 'season' and 'simulation_ID'.
  • have 'match_ID' smaller than (and not equal to) the current 'match_ID'.

such that I find the accumulated number of points up to that match, for that team, season and simulation_ID, i.e. cumsum(simulation$team_points).

I have issues to implement the second condition without using an extremely slow for-loop.

The data looks like this:

match_ID season simulation_ID home_team team match_result team_points
2084 2020-2021 1 TRUE Liverpool Away win 0
2084 2020-2021 2 TRUE Liverpool Draw 1
2084 2020-2021 3 TRUE Liverpool Away win 0
2084 2020-2021 4 TRUE Liverpool Away win 0
2084 2020-2021 5 TRUE Liverpool Home win 3
2084 2020-2021 1 FALSE Burnley Home win 0
2084 2020-2021 2 FALSE Burnley Draw 1

My current solution is:

simulation$accumulated_points <- 0

for (row in 1:nrow(simulation)) {
  simulation$accumulated_points[row] <- 
  sum(simulation$team_points[simulation$season==simulation$season[row] & 
      simulation$match_ID<simulation$match_ID[row] & 
      simulation$simulation_ID==simulation$simulation_ID[row] & 
      simulation$team==simulation$team[row]], na.rm = TRUE)
}

This works, but it is obviously too slow to use on large data sets. I cannot figure out how to speed it up. What is a good solution here?

Leonardo
  • 2,439
  • 33
  • 17
  • 31
Tor97
  • 59
  • 6
  • iterating over rows in a for loop is extremely inefficient. you should use `apply()` – shs Jan 24 '21 at 16:28
  • 1
    `apply` is not necessarily faster than an equivalent `for`-loop, see https://stackoverflow.com/a/42440872/4755970 – hdkrgr Jan 24 '21 at 16:30

1 Answers1

2

For loops are always slow in scripting languages like R and should best be avoided. This can be done using "vectorized operations", that apply a function to a vector rather than each element separately. Native functions in R or popular packages often rely on optimized C++ code and linear algebra libraries under the hood to do this, such that operations become much faster than a loop in R. For example, your CPU is usually able to process dozens of vector elements at the same time rather than going 1-by-1 as in a for loop. You can find more information about vectorization in this question.

In your specific example, you could for example use dplyr to transform your data:

library(dplyr)

df %>%
  # you want to perform the same operation for each of the groups
  group_by(team, season, simulationID) %>%
  # within each group, order the data by match_ID (ascending)
  arrange(match_ID) %>%
  # take the vector team_points in each group then calculate its cumsum
  # write that cumsum into a new column named "points"
  mutate(points = cumsum(team_points))

The code above essentially decomposes the match_points column into one vector for each group that you care about, then applies a single, highly optimized operation to each of them.

hdkrgr
  • 1,666
  • 1
  • 12
  • 22
  • 1
    Exactly what I needed. I was aware that I am supposed to use a vectorization here. The missing link was the "arrange" part. Greatly appreciated, thank you! – Tor97 Jan 24 '21 at 16:43