1

Given the structure of my data below, I would like to calculate the forward five years average annual growth rate (where the first year is lead(markup) of the column markup for each individual (individual are identified in the column (gvkey)), and add to the data frame that average as a column. However, some individuals have less than five years observations, and for all individuals', their last 4 years observations, have less than 5 years of observations ahead of them. For those cases, the average annual growth rate should adjust to the number of observation ahead of them (with a maximum of 5).

dput(example)
structure(list(gvkey = c(1001L, 1001L, 1001L, 1003L, 1003L, 1003L, 
1003L, 1003L, 1003L, 1003L, 1004L, 1004L, 1004L, 1004L, 1004L, 
1004L, 1004L, 1004L, 1004L, 1004L, 1004L, 1004L, 1004L, 1004L, 
1004L, 1004L, 1004L, 1004L, 1004L, 1004L), fyear = c(1983L, 1984L, 
1985L, 1983L, 1984L, 1985L, 1986L, 1987L, 1988L, 1989L, 1980L, 
1981L, 1982L, 1983L, 1984L, 1985L, 1986L, 1987L, 1988L, 1989L, 
1990L, 1991L, 1992L, 1993L, 1994L, 1995L, 1996L, 1997L, 1998L, 
1999L), markup = c(3.02456418383518, 2.91714600416106, 2.97620103473762, 
0.628645648836935, 0.538264738598443, 0.74536402337831, 0.89905329776662, 
0.571759161863088, 0.510497237569061, 0.621391904401246, 0.320146680750145, 
0.277978758953348, 0.31442332968701, 0.319433516915814, 0.324865816687745, 
0.335264348013352, 0.328048313395744, 0.326632245360565, 0.340874293859881, 
0.320374201245953, 0.27456562124358, 0.276693369097675, 0.245072145096866, 
0.241026046834387, 0.242841330851661, 0.249635000371186, 0.257903948772679, 
0.262641379065405, 0.261534064206543, 0.22953354130982)), class = c("grouped_df", 
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -30L), groups = structure(list(
    gvkey = c(1001L, 1003L, 1004L), .rows = structure(list(1:3, 
        4:10, 11:30), ptype = integer(0), class = c("vctrs_list_of", 
    "vctrs_vctr", "list"))), class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -3L), .drop = TRUE))

This is as far as I got:

example %>%
  filter(fyear %in% 1980:2019)%>%
  group_by((gvkey))%>%
  mutate(markupchange = ((((lead(markup)-markup)/markup)+(lead(markup, K =2)-lead(markup)/lead(markup))+(lead(markup, K =3)-lead(markup, k =2)/lead(markup, K=2))+(lead(markup, K =4)-lead(markup, k =3)/lead(markup, K=3))+(lead(markup, K =5)-lead(markup, k =4)/lead(markup, K=4))/5)))

What I can't figure out how to indicate is to shorten the lenght of the average annual growth rate for thoses cases with less than 5 observations ahead.

As an ouput I would like to get back the same data frame with the extra column for the average annual growth rate of the markup. The value in row 1 of the added column should be -0,00628231878798876 and in the second row 0,020547945. Many thanks for any tips.

2 Answers2

2

Since your system is complaining about the size of your data, I'm going to recommend a non-join method (see the answer history for the first cut on this).

  1. It is a bespoke function for each group, so it internally determines which data and the mean;
  2. If necessary (based on memory complaints), I will suggest a data.table variant. I'll also provide the dplyr version that works with this data, but it may not work with your larger data, in which case data.table (with its in-place referential semantics) will be preferred.

First, a quick helper-function:

fun <- function(mkup, fyr, span=5) {
  sapply(fyr, function(yr) {
    val <- mkup[between(fyr, yr, yr+span)]
    mean(c(diff(val), NA) / val, na.rm = TRUE)
  })
}

noting that between can be from either dplyr or data.table, they are equivalent in this use.

### data.table
library(data.table)
EX <- as.data.table(example)
EX[, avg5 := fun(markup, fyear), by = gvkey]
### (same result as below)

### dplyr
example %>%
  group_by(gvkey) %>%
  mutate(avg5 = fun(markup, fyear)) %>%
  ungroup()
# # A tibble: 30 × 4
#    gvkey fyear markup      avg5
#    <int> <int>  <dbl>     <dbl>
#  1  1001  1983  3.02   -0.00764
#  2  1001  1984  2.92    0.0202 
#  3  1001  1985  2.98  NaN      
#  4  1003  1983  0.629  -0.00480
#  5  1003  1984  0.538   0.0674 
#  6  1003  1985  0.745  -0.0119 
#  7  1003  1986  0.899  -0.0847 
#  8  1003  1987  0.572   0.0550 
#  9  1003  1988  0.510   0.217  
# 10  1003  1989  0.621 NaN      
# # … with 20 more rows
# # ℹ Use `print(n = ...)` to see more rows

r2evans
  • 141,215
  • 6
  • 77
  • 149
  • hi, thanks! what do you mean by non-equi join method? (sorry I'm not into the jargon at all) And also, what I nead is the 5 year forward average annual growth rate, so in row 3 the value of the added column should be NA. – Ayoze Alfageme Jan 21 '23 at 18:24
  • 1
    the value in row 1 of the added column should be -0,033112583 and in the second row 0,020547945 – Ayoze Alfageme Jan 21 '23 at 18:26
  • For most uses, the premise of a [merge/join](https://stackoverflow.com/q/1299871/3358272) is based on equality, but the true calculus of it can be true inequalities (`>`, `<=`, ...) with potentially explosive (number of rows) results. In this case, when we join `example` on itself, we do it by `"gvkey"` and `fyear` between `fyear` and `fy5`; this results in a much-larger (close to 5x the number of rows) dataset which we must then summarize back to the original rows (ergo the need to add `rn` in the beginning). – r2evans Jan 21 '23 at 18:26
  • Okay, I see you want percentage growth. Please edit your original question to add an expected value for at least some of your rows. – r2evans Jan 21 '23 at 18:28
  • edited again for a correction in the results – Ayoze Alfageme Jan 21 '23 at 18:43
  • See my edit @AyozeAlfageme, I think it's closer. I wonder if your expected value of `0,00628231878798876` (from the question, not from your comment above) is different due to rounding and calculating manually, which I can reproduce with `mean(c((2.92-3.02)/3.02, (2.98-2.92)/2.92))` though the real values in the frame have more digits and therefore a different gain. – r2evans Jan 21 '23 at 18:58
  • but what I want is the Average Annual growth rate of the next 5 years. So GR1= growth rate next years, GR2= growth rate two years ahead, etc... so the average annual growth rate would be = (GR1 +GR2 +GR3 +GR4+GR5)/5 – Ayoze Alfageme Jan 21 '23 at 19:12
  • 1
    Please demonstrate how what I calculated here is incorrect. For example, `gvkey==1001` (top 3 rows) gives us: `filter(example, gvkey==1003) %>% summarize(GR1=(markup[2]-markup[1])/markup[1], GR2=(markup[3]-markup[2])/markup[2], GR3=(markup[3]-markup[2])/markup[2], GR4=(markup[4]-markup[3])/markup[3], GR5=(markup[5]-markup[4])/markup[4])`. This produces GRs of `-0.144 0.385 0.206 -0.364 -0.107` (all numbers here are rounded for print, real-float underneath). Averaging those is `-0.00480` (again, rounded from real-floats), which is what my answer produces. What am I missing? – r2evans Jan 21 '23 at 19:33
  • (I started saying `1001` but calculated on `1003`, sorry for the mistype above ... it's all on 1003.) – r2evans Jan 21 '23 at 20:02
  • yes, I realized. All good. – Ayoze Alfageme Jan 21 '23 at 23:06
  • I am now using the code for a much larger dataframe but I only get NaN after row 30. Is the code cap to 30 rows or something like that? – Ayoze Alfageme Jan 21 '23 at 23:12
  • I am sorry for asking those questions above, but I don't quite understand the full code, which makes it quite unwidely for me to change it. I don't understand why it does not work for a larger database... – Ayoze Alfageme Jan 21 '23 at 23:47
  • *"after row 30"* doesn't give me a lot, is that the first 30 rows of the whole frame, or 30 rows per `gvkey`? See my edit, it explains a bit more. – r2evans Jan 22 '23 at 16:39
  • 30 rows of the whole frame – Ayoze Alfageme Jan 22 '23 at 17:17
  • When I upload a larger file (+200,000 rows), R gives me the message that the vector of large size cannot be allocated. I guess my computer is not powerful enough. Is there any way to make the computation more efficient? – Ayoze Alfageme Jan 22 '23 at 17:36
  • 1
    I didn't think 200,000 would be a problem, since at most it expands to 1Mi rows (which is generally fine in R with reasonable amount of RAM on the system). However, I've made a (significantly-different) edit to the answer, see if that makes sense. – r2evans Jan 22 '23 at 18:34
  • I does make sense and my computer goes through quite fast. Thanks! – Ayoze Alfageme Jan 22 '23 at 19:04
0

Here's a solution using frollmean() which might be more computationally efficient.

First we need a helper to create rolling partial windows for use in frollmean(). We'll use it to set bounds on the window sizes based on the group sizes.

# Get rolling window sizes, including partial
window_seq <- function(k, n, partial = TRUE){
  if (length(k) != 1L) stop("k must be of length 1.")
  if (length(n) != 1L) stop("n must be of length 1.")
  if (n > .Machine[["integer.max"]]){
    stop("n must not be greater than .Machine$integer.max")
  }
  n <- as.integer(n)
  k <- as.integer(k)
  k <- min(k, n) # Bound k to <= n
  k <- max(k, 0L) # Bound k to >= 0
  pk <- max(k - 1L, 0L) # Partial k, bounded to >= 0
  p_seq <- seq_len(pk) # Partial window sequence
  out <- rep_len(k, n)
  # Replace partial part with partial sequence
  if (partial){
    out[p_seq] <- p_seq
  } else {
    out[p_seq] <- NA_integer_
  }
  out
}

data.table

library(data.table)

# data.table
example2 <- copy(example)
setDT(example2)

example2[, pchange := (markup/shift(markup)) - 1,
         by = gvkey]
example2[, row_id := rowid(gvkey)]
setorderv(example2, c("gvkey", "row_id"), order = c(1L, -1L))
example2[, avg5 := frollmean(shift(pchange), n = window_seq(5, .N),
                             adaptive = T, na.rm = T),
         by = gvkey]
setorderv(example2, c("gvkey", "row_id"))
print(example2, n = 10)
#>     gvkey fyear    markup      pchange row_id         avg5
#>  1:  1001  1983 3.0245642           NA      1 -0.007635573
#>  2:  1001  1984 2.9171460 -0.035515259      2  0.020244112
#>  3:  1001  1985 2.9762010  0.020244112      3          NaN
#>  4:  1003  1983 0.6286456           NA      1 -0.004802629
#>  5:  1003  1984 0.5382647 -0.143770836      2  0.067397285
#> ---                                                       
#> 26:  1004  1995 0.2496350  0.027975755     16 -0.018769986
#> 27:  1004  1996 0.2579039  0.033124155     17 -0.036068033
#> 28:  1004  1997 0.2626414  0.018368972     18 -0.063286535
#> 29:  1004  1998 0.2615341 -0.004216072     19 -0.122356998
#> 30:  1004  1999 0.2295335 -0.122356998     20          NaN

dplyr

library(dplyr)
example %>%
  group_by(gvkey) %>%
  mutate(pchange = (markup/lag(markup)) - 1) %>%
  mutate(row_id = row_number()) %>%
  arrange(desc(row_id), .by_group = TRUE) %>%
  mutate(avg5 = frollmean(lag(pchange), n = window_seq(5, n()), 
                          adaptive = T, na.rm = T)) %>%
  arrange(row_id, .by_group = TRUE)
#> # A tibble: 30 x 6
#> # Groups:   gvkey [3]
#>    gvkey fyear markup pchange row_id      avg5
#>    <int> <int>  <dbl>   <dbl>  <int>     <dbl>
#>  1  1001  1983  3.02  NA           1  -0.00764
#>  2  1001  1984  2.92  -0.0355      2   0.0202 
#>  3  1001  1985  2.98   0.0202      3 NaN      
#>  4  1003  1983  0.629 NA           1  -0.00480
#>  5  1003  1984  0.538 -0.144       2   0.0674 
#>  6  1003  1985  0.745  0.385       3  -0.0119 
#>  7  1003  1986  0.899  0.206       4  -0.0847 
#>  8  1003  1987  0.572 -0.364       5   0.0550 
#>  9  1003  1988  0.510 -0.107       6   0.217  
#> 10  1003  1989  0.621  0.217       7 NaN      
#> # i 20 more rows

For actual "growth rates", one can use geometric mean of percent changes which gives you the expected percent change per unit time.

# Growth rates
example %>%
  group_by(gvkey) %>%
  mutate(pchange = (markup/lag(markup))) %>%
  mutate(row_id = row_number()) %>%
  arrange(desc(row_id), .by_group = TRUE) %>%
  mutate(growth_rate5 = exp(frollmean(log(lag(pchange)), n = window_seq(5, n()),
                                      adaptive = T, na.rm = T)) - 1) %>%
  arrange(row_id, .by_group = TRUE)
#> # A tibble: 30 x 6
#> # Groups:   gvkey [3]
#>    gvkey fyear markup pchange row_id growth_rate5
#>    <int> <int>  <dbl>   <dbl>  <int>        <dbl>
#>  1  1001  1983  3.02   NA          1     -0.00803
#>  2  1001  1984  2.92    0.964      2      0.0202 
#>  3  1001  1985  2.98    1.02       3    NaN      
#>  4  1003  1983  0.629  NA          1     -0.0408 
#>  5  1003  1984  0.538   0.856      2      0.0291 
#>  6  1003  1985  0.745   1.38       3     -0.0445 
#>  7  1003  1986  0.899   1.21       4     -0.116  
#>  8  1003  1987  0.572   0.636      5      0.0425 
#>  9  1003  1988  0.510   0.893      6      0.217  
#> 10  1003  1989  0.621   1.22       7    NaN      
#> # i 20 more rows
NicChr
  • 858
  • 1
  • 9