1

I have a very large database that looks like this. For cntext, the data appartains to different companies with their related CEOs (ID) and the different years each CEO was in charge

ID <- c(1,1,1,1,1,1,3,3,3,5,5,4,4,4,4,4,4,4)
C <- c('a','a','a','a','a','a','b','b','b','b','b','c','c','c','c','c','c','c')
fyear <- c(2000, 2001, 2002,2003,2004,2005,2000, 2001,2002,2003,2004,2000, 2001, 2002,2003,2004,2005,2006)
data <- c(30,50,22,3,6,11,5,3,7,6,9,31,5,6,7,44,33,2)
df1 <- data.frame(ID,C,fyear, data)

ID  C   fyear  data
1   a   2000    30  
1   a   2001    50  
1   a   2002    22  
1   a   2003    3   
1   a   2004    6   
1   a   2005    11  
3   b   2000    5   
3   b   2001    3   
3   b   2002    7   
5   b   2003    6   
5   b   2004    9   
4   c   2000    31  
4   c   2001    5   
4   c   2002    6   
4   c   2003    7   
4   c   2004    44  
4   c   2005    33  
4   c   2006    2       

I need to build a code that allows me to sum up the previous 5 and 3 data related to each ID for every year. So t-3 and t-5 for each year. The result is something like this.

ID  C   fyear  data data3data5
1   a   2000    30  NA  NA
1   a   2001    50  NA  NA
1   a   2002    22  102 NA
1   a   2003    3   75  NA
1   a   2004    6   31  111
1   a   2005    11  20  86
3   b   2000    5   NA  NA
3   b   2001    3   NA  NA
3   b   2002    7   15  NA
5   b   2003    6   NA  NA
5   b   2004    9   NA  NA
4   c   2000    31  NA  NA
4   c   2001    5   NA  NA
4   c   2002    6   42  NA
4   c   2003    7   18  NA
4   c   2004    44  57  93
4   c   2005    33  84  95
4   c   2006    2   79  92

I have different columns of data for which I need to perform this operation, so if somebody also knows how I can do that and create a data3 and data5 column also for the other columns of data that I have that would be amazing. But even just being able to do the summation that I need is great! Thanks a lot. I hav looked around but don't seem to find any similar cses that satisfy my need

Jon Spring
  • 55,165
  • 4
  • 35
  • 53
  • Should `company` be `C`? And there's a one-to-one correspondence between `ID` and `C`/`company` which doesn't help... – Limey Nov 11 '22 at 15:25
  • Company b is related to two different IDs (3 and 5) so there's no one-to-one correspondence which is what makes it difficult for me to build a proper function or line of code – Erika Dal Cortivo Nov 14 '22 at 09:33

3 Answers3

1

We can use rollsumr to perform the rolling sums.

library(dplyr, exclude = c("filter", "lag"))
library(zoo)

df1 %>%
  group_by(ID, C) %>%
  mutate(data3 = rollsumr(data, 3, fill = NA),
         data5 = rollsumr(data, 5, fill = NA)) %>%
  ungroup
## # A tibble: 18 x 6
##       ID C     fyear  data data3 data5
##    <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
##  1     1 a      2000    30    NA    NA
##  2     1 a      2001    50    NA    NA
##  3     1 a      2002    22   102    NA
##  4     1 a      2003     3    75    NA
##  5     1 a      2004     6    31   111
...snip...

To apply that to multiple columns, e.g. to apply it to fyear and to data use across:

df1 %>%
  group_by(ID, C) %>%
  mutate(across(c("fyear", "data"), 
             list(`3` = ~ rollsumr(., 3, fill = NA), 
                  `5` = ~ rollsumr(., 5, fill = NA)), 
             .names = "{.col}{.fn}")) %>%
  ungroup
## # A tibble: 18 x 8
##       ID C     fyear  data fyear3 fyear5 data3 data5
##    <dbl> <chr> <dbl> <dbl>  <dbl>  <dbl> <dbl> <dbl>
##  1     1 a      2000    30     NA     NA    NA    NA
##  2     1 a      2001    50     NA     NA    NA    NA
##  3     1 a      2002    22   6003     NA   102    NA
##  4     1 a      2003     3   6006     NA    75    NA
##  5     1 a      2004     6   6009  10010    31   111
...snip...
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
0

To solve your specific question, this is a tidyverse solution:

df1 %>% 
  arrange(C, ID, fyear) %>% 
  group_by(C, ID) %>% 
  mutate(
    fyear3=rowSums(list(sapply(1:3, function(x) lag(data, x)))[[1]]),
    fyear5=rowSums(list(sapply(1:5, function(x) lag(data, x)))[[1]])
  ) %>%
  ungroup()
# A tibble: 18 × 6
      ID C     fyear  data fyear3 fyear5
   <dbl> <chr> <dbl> <dbl>  <dbl>  <dbl>
 1     1 a      2000    30     NA     NA
 2     1 a      2001    50     NA     NA
 3     1 a      2002    22     NA     NA
 4     1 a      2003     3    102     NA
 5     1 a      2004     6     75     NA
 6     1 a      2005    11     31    111
 7     3 b      2000     5     NA     NA
 8     3 b      2001     3     NA     NA
 9     3 b      2002     7     NA     NA
10     5 b      2003     6     NA     NA
11     5 b      2004     9     NA     NA
12     4 c      2000    31     NA     NA
13     4 c      2001     5     NA     NA
14     4 c      2002     6     NA     NA
15     4 c      2003     7     42     NA
16     4 c      2004    44     18     NA
17     4 c      2005    33     57     93
18     4 c      2006     2     84     95

The first mutate is a little hairy, so lets break one of the assignments down...

Find the nth lagged values of the data column, for n=1, 2 and 3.

sapply(1:3, function(x) lag(data, x))

Changes in CEO and Company are handled by the group_by() earlier in the pipe.

Create a list of these lagged values.

list(sapply(1:3, function(x) lag(data, x)))[[1]]

Row by row, calculate the sums of the lagged values

fyear3=rowSums(list(sapply(1:3, function(x) lag(data, x)))[[1]])

Now generalise the problem. Write a function takes as its inputs a dataset (so it works in a pipe), the new column, the column containing the values for which a lagged sum is required, and an integer defining the maximum lag.

lagSum <- function(data, newCol, valueCol, maxLag) {
  data %>% 
    mutate(
      {{newCol}} := rowSums(
                      list(
                        sapply(
                          1:maxLag, 
                          function(x) lag({{valueCol}}, x)
                        )
                      )[[1]]
                    )
    ) %>% 
    ungroup()
}

The embracing ({{ and }}) and use of := is required to handle tidyverse's non-standard evaluation (NSE).

Now use the function.

df1 %>% 
  arrange(C, ID, fyear) %>% 
  group_by(C, ID) %>% 
  lagSum(sumFYear3, data, 3) %>% 
  lagSum(sumFYear5, data, 5)
# A tibble: 18 × 6
      ID C     fyear  data sumFYear3 sumFYear5
   <dbl> <chr> <dbl> <dbl>     <dbl>     <dbl>
 1     1 a      2000    30        NA        NA
 2     1 a      2001    50        NA        NA
 3     1 a      2002    22        NA        NA
 4     1 a      2003     3       102        NA
 5     1 a      2004     6        75        NA
 6     1 a      2005    11        31       111
 7     3 b      2000     5        NA        92
 8     3 b      2001     3        NA        47
 9     3 b      2002     7        NA        28
10     5 b      2003     6        NA        32
11     5 b      2004     9        NA        32
12     4 c      2000    31        NA        30
13     4 c      2001     5        NA        56
14     4 c      2002     6        NA        58
15     4 c      2003     7        42        57
16     4 c      2004    44        18        58
17     4 c      2005    33        57        93
18     4 c      2006     2        84        95

EDIT

I misunderstood what you meant by "lag" and didn't read your description properly. My apologies.

I think your 86 in row 6 of your data5 column should be 92. if not, please explain why not.

Getting the answers you want should be a simple matter of adapting the function I wrote. For example:

lagSum <- function(data, newCol, valueCol, maxLag) {
  data %>% 
    mutate(
      {{newCol}} := {{valueCol}} + rowSums(
        list(
          sapply(
            1:maxLag, 
            function(x) lag({{valueCol}}, x)
          )
        )[[1]]
      )
    ) %>% 
    mutate() %>%
    ungroup() 
}

Gives

df1 %>% 
  arrange(C, ID, fyear) %>% 
  group_by(C, ID) %>% 
  lagSum(sumFYear3, data, 2)
# A tibble: 18 × 5
      ID C     fyear value sumFYear3
   <dbl> <chr> <dbl> <dbl>     <dbl>
 1     1 a      2000    30        NA
 2     1 a      2001    50        NA
 3     1 a      2002    22       102
 4     1 a      2003     3        75
 5     1 a      2004     6        31
 6     1 a      2005    11        20
 7     3 b      2000     5        NA
 8     3 b      2001     3        NA
 9     3 b      2002     7        15
10     5 b      2003     6        NA
11     5 b      2004     9        NA
12     4 c      2000    31        NA
13     4 c      2001     5        NA
14     4 c      2002     6        42
15     4 c      2003     7        18
16     4 c      2004    44        57
17     4 c      2005    33        84
18     4 c      2006     2        79

and

df1 %>% 
  arrange(C, ID, fyear) %>% 
  group_by(C, ID) %>% 
  lagSum(sumFYear5, data, 4)
# A tibble: 18 × 5
      ID C     fyear  data sumFYear5
   <dbl> <chr> <dbl> <dbl>     <dbl>
 1     1 a      2000    30        NA
 2     1 a      2001    50        NA
 3     1 a      2002    22        NA
 4     1 a      2003     3        NA
 5     1 a      2004     6       111
 6     1 a      2005    11        92
 7     3 b      2000     5        NA
 8     3 b      2001     3        NA
 9     3 b      2002     7        NA
10     5 b      2003     6        NA
11     5 b      2004     9        NA
12     4 c      2000    31        NA
13     4 c      2001     5        NA
14     4 c      2002     6        NA
15     4 c      2003     7        NA
16     4 c      2004    44        93
17     4 c      2005    33        95
18     4 c      2006     2        92

as expected, but

df1 %>% 
  arrange(C, ID, fyear) %>% 
  group_by(C, ID) %>% 
  lagSum(sumFYear3, data, 2) %>% 
  lagSum(sumFYear5, data, 4)
# A tibble: 18 × 6
      ID C     fyear  data sumFYear3 sumFYear5
   <dbl> <chr> <dbl> <dbl>     <dbl>     <dbl>
 1     1 a      2000    30        NA        NA
 2     1 a      2001    50        NA        NA
 3     1 a      2002    22       102        NA
 4     1 a      2003     3        75        NA
 5     1 a      2004     6        31       111
 6     1 a      2005    11        20        92
 7     3 b      2000     5        NA        47
 8     3 b      2001     3        NA        28
 9     3 b      2002     7        15        32
10     5 b      2003     6        NA        32
11     5 b      2004     9        NA        30
12     4 c      2000    31        NA        56
13     4 c      2001     5        NA        58
14     4 c      2002     6        42        57
15     4 c      2003     7        18        58
16     4 c      2004    44        57        93
17     4 c      2005    33        84        95
18     4 c      2006     2        79        92

Not as expected. At the moment, I cannot explain why. I managed to get the correct answers for both 3 and 5 year lags in the same pipe with:

df1 %>% 
  arrange(C, ID, fyear) %>% 
  group_by(C, ID) %>% 
  lagSum(sumFYear3, data, 2) %>% 
  left_join(
    df1 %>% 
      arrange(C, ID, fyear) %>% 
      group_by(C, ID) %>% 
      lagSum(sumFYear5, data, 4)
  )

But that shouldn't be necessary. I will think about this some more and may post a question of my own if I can't find an explanation.

Alternatively, this question gives a solution using the zoo package.

Limey
  • 10,234
  • 2
  • 12
  • 32
  • Hi, thank you a lot for your answer! I have a few questions, I see that the answers (yours and my example) don't coincide. Maybe it's because I did not properly explain what I was looking for. I'm looking to get the sum of the most recent 3 and 5 year data for each ID in each company for each year. So the sumFY5 that you get is definitely not exactly what I'm looking for, as you can see from the example. Is there a way to fix it to get the result as close as possible to what I provided? Thanks a lot. Also in sumFY3 some results are missing. – Erika Dal Cortivo Nov 14 '22 at 09:26
  • Also, now that I look closer, it may be more clear if I specify that, for each year, I want to have the sum of the data for that year + the previous 2 or 4 years – Erika Dal Cortivo Nov 14 '22 at 13:47
  • Yes, i calculated the numbers by hand so that 92 is rigth. Thank you a lot for this answer. This is in line with what I was trying to achieve. I'll look into it and try to understand so I can replicate it in the future. Thanks a lot! – Erika Dal Cortivo Nov 14 '22 at 16:13
0

We can use frollsum within data.table

library(data.table)

d <- 2:5

setDT(df1)[
  ,
  c(paste0("data", d)) := lapply(d, frollsum, x = data),
  .(ID, C)
]

which yields

> df1
    ID C fyear data data2 data3 data4 data5
 1:  1 a  2000   30    NA    NA    NA    NA
 2:  1 a  2001   50    80    NA    NA    NA
 3:  1 a  2002   22    72   102    NA    NA
 4:  1 a  2003    3    25    75   105    NA
 5:  1 a  2004    6     9    31    81   111
 6:  1 a  2005   11    17    20    42    92
 7:  3 b  2000    5    NA    NA    NA    NA
 8:  3 b  2001    3     8    NA    NA    NA
 9:  3 b  2002    7    10    15    NA    NA
10:  5 b  2003    6    NA    NA    NA    NA
11:  5 b  2004    9    15    NA    NA    NA
12:  4 c  2000   31    NA    NA    NA    NA
13:  4 c  2001    5    36    NA    NA    NA
14:  4 c  2002    6    11    42    NA    NA
15:  4 c  2003    7    13    18    49    NA
16:  4 c  2004   44    51    57    62    93
17:  4 c  2005   33    77    84    90    95
18:  4 c  2006    2    35    79    86    92
ThomasIsCoding
  • 96,636
  • 9
  • 24
  • 81
  • Thank you a lot for your answer. The only issue I have wth this is that I only need column and and I need to replicate this calculation with many other rows so this code is not as efficient – Erika Dal Cortivo Nov 14 '22 at 16:16
  • @ErikaDalCortivo If you need `data3` and `data5` only, then you can begin with `d <- c(3,5)` instead – ThomasIsCoding Nov 14 '22 at 20:17