R: avoiding "for": running sum with changing beginning in panel data

Question

The main issue here is how to avoid loops when applying functions to subsets of panels.

I want to have data like these:

id  year   w    pdvw
1   1930   2    10
1   1940   3    15.5
1   1950   5    23.5
1   1960   7.5  27.5
1   1970   11   NA
1   1980   9    NA
2   1930   NA   NA
2   1940   NA   NA
2   1950   1    10
2   1960   3    17
2   1970   6    NA
2   1980   8    NA

The actual data are much more numerous and slightly more complex. I am trying to produce the last column (pdvw) from the other columns. pdvw is the sum of the next (in time) three entries of w (e.g. pdvw[1] = 2+3+5). I can easily write

for(t in seq(from=1930,to=1960,by=10)){
  for(i in c(1,2)){
    if(is.na(w[id==i & t==year])==FALSE){
      pdvw[id==i & year==t] = sum(w[id==i & t<=year & year<=t+21])
    }
  }
}

My application is not very large (20 values for year and 150 values for id), but I have been told to avoid such loops when possible, so I want to see if there is a better way. I am not so concerned with avoiding the loop over years, because 20 iterations is negligible, but I do want to learn to be a better coder. I thought something with by might help, but I am not sure exactly what.

My solution above makes use (possibly dangerously) of the fact that missing values for w are never preceded by non-missing values (a mere coincidence of history--hence, the "possibly dangerously"). I included the missing values, because it is important that any solution can deal with the fact that the pdvw calculations must begin once data are available for a given panel.

Possibly useful [**Q&A**](http://stackoverflow.com/questions/16840101/compute-rolling-sum-by-id-variables-with-missing-timepoints/16925973#16925973) — Henrik, Dec 18 '14 at 17:04

Sven Hohenstein · Answer 1 · 2014-12-18T17:11:12.587

1

You don't need a for loop but can use filter:

transform(dat, pdvw2 = ave(w, id, FUN = function(x) c(filter(x, c(1, 1, 1))[-1], NA)))

where dat is the name of your data frame.

The result:

   id year    w pdvw pdvw2
1   1 1930  2.0 10.0  10.0
2   1 1940  3.0 15.5  15.5
3   1 1950  5.0 23.5  23.5
4   1 1960  7.5 27.5  27.5
5   1 1970 11.0   NA    NA
6   1 1980  9.0   NA    NA
7   2 1930   NA   NA    NA
8   2 1940   NA   NA    NA
9   2 1950  1.0 10.0  10.0
10  2 1960  3.0 17.0  17.0
11  2 1970  6.0   NA    NA
12  2 1980  8.0   NA    NA

edited Dec 18 '14 at 17:11

answered Dec 18 '14 at 17:02

Sven Hohenstein

80,497
17
145
168

@randy The current approach is applied to different `id`s independently. – Sven Hohenstein Dec 18 '14 at 19:39
Thanks for the solution. It does produce the desired output. The main problem I have with it is that it requires some hands-on fine-tuning (the `[-1],NA`) to get the vector components in the right place, which could be a source of errors. I think `lead()` might be better, but I am a novice. – randy Dec 18 '14 at 19:58
@randy The `filter` approach is quite efficient. Of course, you have to choose your favourite way to generate the results. – Sven Hohenstein Dec 18 '14 at 20:58

score 0 · Accepted Answer · answered Dec 18 '14 at 17:16

One option with dplyr is:

library(dplyr)
df %>%                                       # take the data frame
  group_by(id) %>%                           # group the data by "id"
  mutate(pdvw2 = w + lead(w) + lead(w,2))    # compute a new column with the sum of 3 values. "lead" takes the next value
#Source: local data frame [12 x 5]
#Groups: id
#
#   id year    w pdvw pdvw2
#1   1 1930  2.0 10.0  10.0
#2   1 1940  3.0 15.5  15.5
#3   1 1950  5.0 23.5  23.5
#4   1 1960  7.5 27.5  27.5
#5   1 1970 11.0   NA    NA
#6   1 1980  9.0   NA    NA
#7   2 1930   NA   NA    NA
#8   2 1940   NA   NA    NA
#9   2 1950  1.0 10.0  10.0
#10  2 1960  3.0 17.0  17.0
#11  2 1970  6.0   NA    NA
#12  2 1980  8.0   NA    NA

(where df is the name of your data frame)

For anyone else who reads this, this use of dplyr::mutate() is the same as using mutate() from package plyr and (in this application) transform() from base. — randy, Dec 18 '14 at 19:50
If you used the mutate on its own, yes. But I also use group_by to apply the function to each group of id. You would notice the difference if your data in w had less NA entries. Check it out. @randy — talat, Dec 18 '14 at 20:03

R: avoiding "for": running sum with changing beginning in panel data

2 Answers2