To solve your specific question, this is a tidyverse solution:
df1 %>%
arrange(C, ID, fyear) %>%
group_by(C, ID) %>%
mutate(
fyear3=rowSums(list(sapply(1:3, function(x) lag(data, x)))[[1]]),
fyear5=rowSums(list(sapply(1:5, function(x) lag(data, x)))[[1]])
) %>%
ungroup()
# A tibble: 18 × 6
ID C fyear data fyear3 fyear5
<dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 a 2000 30 NA NA
2 1 a 2001 50 NA NA
3 1 a 2002 22 NA NA
4 1 a 2003 3 102 NA
5 1 a 2004 6 75 NA
6 1 a 2005 11 31 111
7 3 b 2000 5 NA NA
8 3 b 2001 3 NA NA
9 3 b 2002 7 NA NA
10 5 b 2003 6 NA NA
11 5 b 2004 9 NA NA
12 4 c 2000 31 NA NA
13 4 c 2001 5 NA NA
14 4 c 2002 6 NA NA
15 4 c 2003 7 42 NA
16 4 c 2004 44 18 NA
17 4 c 2005 33 57 93
18 4 c 2006 2 84 95
The first mutate is a little hairy, so lets break one of the assignments down...
Find the nth lagged values of the data
column, for n=1, 2 and 3.
sapply(1:3, function(x) lag(data, x))
Changes in CEO and Company are handled by the group_by()
earlier in the pipe.
Create a list of these lagged values.
list(sapply(1:3, function(x) lag(data, x)))[[1]]
Row by row, calculate the sums of the lagged values
fyear3=rowSums(list(sapply(1:3, function(x) lag(data, x)))[[1]])
Now generalise the problem. Write a function takes as its inputs a dataset (so it works in a pipe), the new column, the column containing the values for which a lagged sum is required, and an integer defining the maximum lag.
lagSum <- function(data, newCol, valueCol, maxLag) {
data %>%
mutate(
{{newCol}} := rowSums(
list(
sapply(
1:maxLag,
function(x) lag({{valueCol}}, x)
)
)[[1]]
)
) %>%
ungroup()
}
The embracing ({{
and }}
) and use of :=
is required to handle tidyverse's non-standard evaluation (NSE).
Now use the function.
df1 %>%
arrange(C, ID, fyear) %>%
group_by(C, ID) %>%
lagSum(sumFYear3, data, 3) %>%
lagSum(sumFYear5, data, 5)
# A tibble: 18 × 6
ID C fyear data sumFYear3 sumFYear5
<dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 a 2000 30 NA NA
2 1 a 2001 50 NA NA
3 1 a 2002 22 NA NA
4 1 a 2003 3 102 NA
5 1 a 2004 6 75 NA
6 1 a 2005 11 31 111
7 3 b 2000 5 NA 92
8 3 b 2001 3 NA 47
9 3 b 2002 7 NA 28
10 5 b 2003 6 NA 32
11 5 b 2004 9 NA 32
12 4 c 2000 31 NA 30
13 4 c 2001 5 NA 56
14 4 c 2002 6 NA 58
15 4 c 2003 7 42 57
16 4 c 2004 44 18 58
17 4 c 2005 33 57 93
18 4 c 2006 2 84 95
EDIT
I misunderstood what you meant by "lag" and didn't read your description properly. My apologies.
I think your 86
in row 6 of your data5
column should be 92
. if not, please explain why not.
Getting the answers you want should be a simple matter of adapting the function I wrote. For example:
lagSum <- function(data, newCol, valueCol, maxLag) {
data %>%
mutate(
{{newCol}} := {{valueCol}} + rowSums(
list(
sapply(
1:maxLag,
function(x) lag({{valueCol}}, x)
)
)[[1]]
)
) %>%
mutate() %>%
ungroup()
}
Gives
df1 %>%
arrange(C, ID, fyear) %>%
group_by(C, ID) %>%
lagSum(sumFYear3, data, 2)
# A tibble: 18 × 5
ID C fyear value sumFYear3
<dbl> <chr> <dbl> <dbl> <dbl>
1 1 a 2000 30 NA
2 1 a 2001 50 NA
3 1 a 2002 22 102
4 1 a 2003 3 75
5 1 a 2004 6 31
6 1 a 2005 11 20
7 3 b 2000 5 NA
8 3 b 2001 3 NA
9 3 b 2002 7 15
10 5 b 2003 6 NA
11 5 b 2004 9 NA
12 4 c 2000 31 NA
13 4 c 2001 5 NA
14 4 c 2002 6 42
15 4 c 2003 7 18
16 4 c 2004 44 57
17 4 c 2005 33 84
18 4 c 2006 2 79
and
df1 %>%
arrange(C, ID, fyear) %>%
group_by(C, ID) %>%
lagSum(sumFYear5, data, 4)
# A tibble: 18 × 5
ID C fyear data sumFYear5
<dbl> <chr> <dbl> <dbl> <dbl>
1 1 a 2000 30 NA
2 1 a 2001 50 NA
3 1 a 2002 22 NA
4 1 a 2003 3 NA
5 1 a 2004 6 111
6 1 a 2005 11 92
7 3 b 2000 5 NA
8 3 b 2001 3 NA
9 3 b 2002 7 NA
10 5 b 2003 6 NA
11 5 b 2004 9 NA
12 4 c 2000 31 NA
13 4 c 2001 5 NA
14 4 c 2002 6 NA
15 4 c 2003 7 NA
16 4 c 2004 44 93
17 4 c 2005 33 95
18 4 c 2006 2 92
as expected, but
df1 %>%
arrange(C, ID, fyear) %>%
group_by(C, ID) %>%
lagSum(sumFYear3, data, 2) %>%
lagSum(sumFYear5, data, 4)
# A tibble: 18 × 6
ID C fyear data sumFYear3 sumFYear5
<dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 a 2000 30 NA NA
2 1 a 2001 50 NA NA
3 1 a 2002 22 102 NA
4 1 a 2003 3 75 NA
5 1 a 2004 6 31 111
6 1 a 2005 11 20 92
7 3 b 2000 5 NA 47
8 3 b 2001 3 NA 28
9 3 b 2002 7 15 32
10 5 b 2003 6 NA 32
11 5 b 2004 9 NA 30
12 4 c 2000 31 NA 56
13 4 c 2001 5 NA 58
14 4 c 2002 6 42 57
15 4 c 2003 7 18 58
16 4 c 2004 44 57 93
17 4 c 2005 33 84 95
18 4 c 2006 2 79 92
Not as expected. At the moment, I cannot explain why. I managed to get the correct answers for both 3 and 5 year lags in the same pipe with:
df1 %>%
arrange(C, ID, fyear) %>%
group_by(C, ID) %>%
lagSum(sumFYear3, data, 2) %>%
left_join(
df1 %>%
arrange(C, ID, fyear) %>%
group_by(C, ID) %>%
lagSum(sumFYear5, data, 4)
)
But that shouldn't be necessary. I will think about this some more and may post a question of my own if I can't find an explanation.
Alternatively, this question gives a solution using the zoo
package.