Speeding up an extremely slow for-loop

Question

This is my first question on stackoverflow, so feel free to criticize the question.

For every row in a data set, I would like to sum the rows that:

have identical 'team', 'season' and 'simulation_ID'.
have 'match_ID' smaller than (and not equal to) the current 'match_ID'.

such that I find the accumulated number of points up to that match, for that team, season and simulation_ID, i.e. cumsum(simulation$team_points).

I have issues to implement the second condition without using an extremely slow for-loop.

The data looks like this:

match_ID	season	simulation_ID	home_team	team	match_result	team_points
2084	2020-2021	1	TRUE	Liverpool	Away win	0
2084	2020-2021	2	TRUE	Liverpool	Draw	1
2084	2020-2021	3	TRUE	Liverpool	Away win	0
2084	2020-2021	4	TRUE	Liverpool	Away win	0
2084	2020-2021	5	TRUE	Liverpool	Home win	3
2084	2020-2021	1	FALSE	Burnley	Home win	0
2084	2020-2021	2	FALSE	Burnley	Draw	1

My current solution is:

simulation$accumulated_points <- 0

for (row in 1:nrow(simulation)) {
  simulation$accumulated_points[row] <- 
  sum(simulation$team_points[simulation$season==simulation$season[row] & 
      simulation$match_ID<simulation$match_ID[row] & 
      simulation$simulation_ID==simulation$simulation_ID[row] & 
      simulation$team==simulation$team[row]], na.rm = TRUE)
}

This works, but it is obviously too slow to use on large data sets. I cannot figure out how to speed it up. What is a good solution here?

iterating over rows in a for loop is extremely inefficient. you should use `apply()` — shs, Jan 24 '21 at 16:28
`apply` is not necessarily faster than an equivalent `for`-loop, see https://stackoverflow.com/a/42440872/4755970 — hdkrgr, Jan 24 '21 at 16:30

hdkrgr · Accepted Answer · 2021-01-24T16:42:44.830

For loops are always slow in scripting languages like R and should best be avoided. This can be done using "vectorized operations", that apply a function to a vector rather than each element separately. Native functions in R or popular packages often rely on optimized C++ code and linear algebra libraries under the hood to do this, such that operations become much faster than a loop in R. For example, your CPU is usually able to process dozens of vector elements at the same time rather than going 1-by-1 as in a for loop. You can find more information about vectorization in this question.

In your specific example, you could for example use dplyr to transform your data:

library(dplyr)

df %>%
  # you want to perform the same operation for each of the groups
  group_by(team, season, simulationID) %>%
  # within each group, order the data by match_ID (ascending)
  arrange(match_ID) %>%
  # take the vector team_points in each group then calculate its cumsum
  # write that cumsum into a new column named "points"
  mutate(points = cumsum(team_points))

The code above essentially decomposes the match_points column into one vector for each group that you care about, then applies a single, highly optimized operation to each of them.

Exactly what I needed. I was aware that I am supposed to use a vectorization here. The missing link was the "arrange" part. Greatly appreciated, thank you! — Tor97, Jan 24 '21 at 16:43

Speeding up an extremely slow for-loop

1 Answers1